This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: exercising current aarch64 kprobe support with systemtap


On 06/27/2016 10:18 AM, Pratyush Anand wrote:
> Hi Will,
> 
> On 23/06/2016:03:22:44 PM, William Cohen wrote:
>> On 06/23/2016 02:26 PM, David Long wrote:
>>> On 06/23/2016 11:49 AM, William Cohen wrote:
>>>> On 06/22/2016 11:18 PM, David Long wrote:
>>>>> On 06/22/2016 04:24 PM, William Cohen wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> When running the current systemtap checked out from the git repository
>>>>>> and a locally built kernel with the kprobes64-v13 patches (the
>>>>>> test_upstream_arm64_devel branch of
>>>>>> https://github.com/pratyushanand/linux) on Fedora 23 machine one of
>>>>>> the kprobes_onthefly.exp tests is causing the machine to get in a
>>>>>> state that requires rebooting to fix.  This can be triggered by running a
>>>>>> portion of the systemtap tests with:
>>>>>>
>>>>>>    make installcheck RUNTESTFLAGS="--debug systemtap.onthefly/kprobes_onthefly.exp"
>>>>>>
>>>>>> When it gets to the kprobes_onthefly - otf_stress_max_iter_5000 test the
>>>>>> console starts spewing the following and needs to be rebooted:
>>>>>>
>>>>>> [23394.036860] Unexpected kernel single-step exception at EL1
>>>>>> [23394.042434] Unexpected kernel single-step exception at EL1
>>>>>> [23394.048008] Unexpected kernel single-step exception at EL1
>>>>>> [23394.053541] Unexpected kernel single-step exception at EL1
>>>>>> [23394.059053] Unexpected kernel single-step exception at EL1
>>>>>> [23394.064545] Unexpected kernel single-step exception at EL1
>>>>>>
>>>>>> Sorry I don't have the start of the failure it scrolled off the screen very quickly.
>>>>>>
>>>>>> -Will
>>>>>>
>>>>>>
>>>>>
>>>>> I'll take a look and see what I can figure out.
>>>>>
>>>>> In the meantime I did just push a v14 branch.  I'm doubtful that it will address the above problem even though it contains a few bug fixes.
>>>>>
>>>>> -dl
>>>>>
>>>>
>>>> Hi Dave and Pratyush,
>>>>
>>>> I tried the kprobes64-v13 kernel and it also seems to work, so it lookw like the problem might be in the the
>>>> test_upstream_arm64_devel branch of https://github.com/pratyushanand/linux .
>>>>
>>>> -Will
>>>>
>>>
>>> I'm going to interpret that as meaning you know of no problem in the kprobes v14 patch that would give me pause to email it upstream.  Do you disagree?
>>>
>>> -dl
>>>
>>
>> Hi Dave,
>>
>> Yes, the problem only seems to be in that other kernel from https://github.com/pratyushanand/linux with the kprobe and uprobe patches, so the arm64 patches do not appear to be the problem.  I don't know what is causing the problem  maybe there is something going on with the porting of the patches to that kernel or the patches included in there (uprobes/kexec) in there. 
> 
> Just to update:
> 
> I confirm that problem arises after uprobe patches only, but not yet sure that
> actual culprit is uprobe code. 
> 
> I can see that kprobes_onthefly.exp also exercises uprobes in the test. It
> seems, when problem happens, there was a kprobe at print_worker_info(). 
> 
> Most likely re-entrant kprobe is called when kprobe is instrumented at
> print_worker_info(). I guessed it could be show_regs() from arm64/kprobe code,
> but commenting show_regs() did not make any difference. Even blacklisting
> print_worker_info() also did not resolve it, probelem reproduced in a different
> way after blacklisting.
> 
> So, still its vague and debugging is continued.
> If I can clearly understand the systemtap test code, then probably it will be
> easier to debug. I mean, if I can get the kernel and user space symbols name
> where this test is instrumenting probes then that would help a lot to zero it
> down.
> 
> ~Pratyush
> 

Hi Pratyush,

My understanding is that the systemtap onthefly support enables/disable the probe as metnioned in the following sytemtap bugzilla entry (and the ones that it is dependent on): https://sourceware.org/bugzilla/show_bug.cgi?id=10995.  It would be handy to things pared down to the systemtap script that triggers the problem.  Putting some diagnostic puts it looks like the script that triggers the problems it looks like it is something like the attached onthefly_trigger.stp (that was gathered on a x86_64 machine so it might not be exactly what is causing the problem on aarch64.  David Smith, any suggestions on how to debug based on your experiences from https://sourceware.org/bugzilla/show_bug.cgi?id=17126 where the ppc64 had a similar issue with onthefly testing?

The "Unexpected kernel single-step exception at EL1" reminds me of the times when kprobes couldn't find a handler.  Maybe there is some situation where the kprobe is being removed but the breakpoint is still around. Did you get a backtrace with the insertino of the "BUG()" where that message is printed out? I wonder if it might be triggered by the (thread_flags & _TIF_UPROBE) somehow being true and the aarch64 do_notify_resume starts running.

-Will

Attachment: onthefly_trigger.stp
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]