This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: context[2] stuck: (null)
- From: Arkady <arkady dot miasnikov at gmail dot com>
- To: systemtap at sourceware dot org
- Date: Tue, 11 Jul 2017 09:46:59 +0300
- Subject: Re: context[2] stuck: (null)
- Authentication-results: sourceware.org; auth=none
- References: <CANA-60q25-tnw72LjrtgMaavsE=VCee4-66fkOHbKBAHWSNqDA@mail.gmail.com> <CANA-60pAhrCXkzkNS5sY_Cypd-_ky=zfHmxHi-dj06QFkrOQAg@mail.gmail.com>
Update. Some of the system calls I am doing in the begin probe are
blocking. I understand that it will break things on multicore systems.
Am I right?
On Tue, Jul 11, 2017 at 9:24 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
> Update. The failure happens consistently in the same context
>
> "context[1] stuck: (null), line_get=36848, line_put=36914
> last_err=(null) last_stmt=identifier 'probe_begin'"
> where line_get and line_put are lines in the enter_be_probe()
>
> 36843 #endif
> 36844 goto probe_epilogue;
> 36845 }
> 36846 if (atomic_read (session_state()) != stp->state)
> 36847 goto probe_epilogue;
> 36848 c = _stp_runtime_entryfn_get_context(__LINE__);
> 36849 if (!c) {
> 36850 #if !INTERRUPTIBLE
> 36851 atomic_inc (skipped_count());
> 36852 #endif
> 36853 #ifdef STP_TIMING
> 36854 atomic_inc (skipped_count_reentrant());
> 36855 #endif
> 36856 goto probe_epilogue;
> 36857 }
> ..................
> 36907 }
> 36908 }
> 36909 probe_epilogue:
> 36910 if (unlikely (atomic_read (skipped_count()) > MAXSKIPPED)) {
> 36911 if (unlikely (pseudo_atomic_cmpxchg(session_state(),
> STAP_SESSION_RUNNING, STAP_SESSION_ERROR) == STAP_SESSION_RUNNING))
> 36912 _stp_error ("Skipped too many probes, check MAXSKIPPED or
> try again with stap -t for more details.");
> 36913 }
> 36914 _stp_runtime_entryfn_put_context(c, __LINE__);
> 36915 #if !INTERRUPTIBLE
> 36916 local_irq_restore (flags);
> 36917 #endif
> 36918 #endif // STP_ALIBI
>
> On Mon, Jul 10, 2017 at 6:16 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
>> Hi,
>>
>> I am getting context[2] stuck: (null) error. The cause of error is
>> likely the "unmanaged" code I have added to the driver. Specifically I
>> have a shared memory (mmap) in the driver. The failure happens
>> randomly every 50-200 module restarts The failure happens only on the
>> multicore CPUs, or happens often enough to be caught.
>>
>> I tried to force the the wait function with
>> STAP_OVERRIDE_STUCK_CONTEXT - kernel panics in one of the (probably
>> random) probes.
>>
>> While debugging the issue I patched the SystemTap source code - added
>> an argument to the _stp_runtime_entryfn_get_context(int) like in this
>> commit https://github.com/larytet/SystemTap/commit/61a284732893fa6f201e07f9f12f5e1820e7c26f
>> In the function _stp_runtime_context_wait() I print the line in the
>> source code which called the _stp_runtime_entryfn_get_context()
>>
>> The "bad" context is enter_be_probe(). I checked the source code of
>> enter_be_probe() and there is not much there.
>>
>> I struggle with the problem for some time and I will greatly
>> appreciate any tip.
>>
>> Thank you, Arkady.