This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: Looking for recommendation for using SystemTap
- From: Tony Reix <tony dot reix at bull dot net>
- To: "Frank Ch. Eigler" <fche at redhat dot com>
- Cc: "systemtap at sourceware dot org" <systemtap at sourceware dot org>
- Date: Mon, 02 Oct 2006 15:42:07 +0200
- Subject: Re: Looking for recommendation for using SystemTap
- References: <1159534951.28410.54.camel@frecb000687.frec.bull.fr> <y0mzmcid2vu.fsf@ton.toronto.redhat.com>
Le vendredi 29 septembre 2006 à 13:00 -0400, Frank Ch. Eigler a écrit :
> Tony Reix <tony.reix@bull.net> writes:
>
> > [...]
> > The analysis of the Oopss clearly show that "someone" writes strings
> > (like "ata" or "ejbo") randomly in memory and destroys links in
> > structures, like vmlilst used by get_vmalloc_info in fs/proc/mmu.c or
> > ulp->proc_list used by loop_undo in ipc/sems.c .
> > [...]
> > Do you think SystemTap can help me finding the culprit ?
> > [...]
>
> Perhaps. Does the memory corruption occur in predictable places?
> Imagine a probe that runs periodically (via a frequently triggered
> timer, or a breakpoint at a code point under suspicion). That probe
> could look through selected places that are corrupted, and check for
> something suspicious.
Up to know, each run (3 of them) has produced a Oops in a different
place (in a different linked list).
Using more options in .config now leads to a crash at the moment the
memory is corrupted. Seems the code I'm trying to test is the culprit !
A suggestion: add a basic SystemTap code to the kernel when these
options are used (memory leak debug, compile kernel with frame ...,
write protect kernel read-only data ...), so that it helps understanding
which code is writing in the wrong places.
> For example:
>
> #! stap -g
> probe kernel.function("after_your_function") { if (checkstuff ()) log ("bug") }
> function checkstuff () /* .... */
>
> What checkstuff() does depends on how a program may be able to assess
> corruption. If it's ascii scripts showing up within known regions of
> valid memory, something like this naive search could do it. (Such a
> function could be encapsulated into the systemtap tapset library).
>
> function checkstuff () %{
> char *begin = 0xdeadbeef;
> char *end = 0xdeadf00d;
> int found = 0;
> char *p;
> for (p = begin; p+3 < end; p++)
> if (p[0] == 'a' && p[1] == 't' && p[2] == 'a') found=1;
> THIS->__retval = found;
> %}
>
> Later, we will have hardware-assisted watchpoint probes that hit when
> a designated area of memory is read and/or written. That could narrow
> the culprits down even further. This might look something lke:
>
> probe kernel.watch.from(0xdeadbeef).to(0xdeadfood).string("ata")
> { log ("bug") }
>
>
> Anyway, this all depends on being able to characterize the corruption
> well enough that a routine could be written to safely check for it.
I think I've got this very important information by recompiling the
kernel with the options I talked here before (Kernel Hacking).
> If you don't have even that much information, very drastic measures
> may be necessary (such as running the kernel under a simulator or
> debugger).
Yes. We've talked about that with colleagues ... UML, ...
They had to fix bugs in the tools before being able to find their
problem ...
Regards,
Tony