This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: context[2] stuck: (null)


Update. Some of the system calls I am doing in the begin probe are
blocking. I understand that it will break things on multicore systems.
Am I right?

On Tue, Jul 11, 2017 at 9:24 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
> Update. The failure happens consistently in the same context
>
> "context[1] stuck: (null), line_get=36848, line_put=36914
> last_err=(null) last_stmt=identifier 'probe_begin'"
> where line_get and line_put are lines in the enter_be_probe()
>
> 36843     #endif
> 36844     goto probe_epilogue;
> 36845   }
> 36846   if (atomic_read (session_state()) != stp->state)
> 36847     goto probe_epilogue;
> 36848   c = _stp_runtime_entryfn_get_context(__LINE__);
> 36849   if (!c) {
> 36850     #if !INTERRUPTIBLE
> 36851     atomic_inc (skipped_count());
> 36852     #endif
> 36853     #ifdef STP_TIMING
> 36854     atomic_inc (skipped_count_reentrant());
> 36855     #endif
> 36856     goto probe_epilogue;
> 36857   }
> ..................
> 36907     }
> 36908   }
> 36909 probe_epilogue:
> 36910   if (unlikely (atomic_read (skipped_count()) > MAXSKIPPED)) {
> 36911     if (unlikely (pseudo_atomic_cmpxchg(session_state(),
> STAP_SESSION_RUNNING, STAP_SESSION_ERROR) == STAP_SESSION_RUNNING))
> 36912     _stp_error ("Skipped too many probes, check MAXSKIPPED or
> try again with stap -t for more details.");
> 36913   }
> 36914   _stp_runtime_entryfn_put_context(c, __LINE__);
> 36915   #if !INTERRUPTIBLE
> 36916   local_irq_restore (flags);
> 36917   #endif
> 36918   #endif // STP_ALIBI
>
> On Mon, Jul 10, 2017 at 6:16 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
>> Hi,
>>
>> I am getting context[2] stuck: (null) error. The cause of error is
>> likely the "unmanaged" code I have added to the driver. Specifically I
>> have a shared memory (mmap) in the driver. The failure happens
>> randomly every 50-200 module restarts The failure happens only on the
>> multicore CPUs, or happens often enough to be caught.
>>
>> I tried to force the the wait function with
>> STAP_OVERRIDE_STUCK_CONTEXT - kernel panics in one of the (probably
>> random) probes.
>>
>> While debugging the issue I  patched the SystemTap source code - added
>> an argument to the _stp_runtime_entryfn_get_context(int) like in this
>> commit https://github.com/larytet/SystemTap/commit/61a284732893fa6f201e07f9f12f5e1820e7c26f
>> In the function _stp_runtime_context_wait() I print the line in the
>> source code which called the _stp_runtime_entryfn_get_context()
>>
>> The "bad" context is enter_be_probe(). I checked the source code of
>> enter_be_probe() and there is not much there.
>>
>> I struggle with the problem for some time and I will greatly
>> appreciate any tip.
>>
>> Thank you, Arkady.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]