This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Looking for recommendation for using SystemTap

From: Tony Reix <tony dot reix at bull dot net>
To: "Frank Ch. Eigler" <fche at redhat dot com>
Cc: "systemtap at sourceware dot org" <systemtap at sourceware dot org>
Date: Mon, 02 Oct 2006 15:42:07 +0200
Subject: Re: Looking for recommendation for using SystemTap
References: <1159534951.28410.54.camel@frecb000687.frec.bull.fr> <y0mzmcid2vu.fsf@ton.toronto.redhat.com>

Le vendredi 29 septembre 2006 à 13:00 -0400, Frank Ch. Eigler a écrit :
> Tony Reix <tony.reix@bull.net> writes:
> 
> > [...]
> > The analysis of the Oopss clearly show that "someone" writes strings
> > (like "ata" or "ejbo") randomly in memory and destroys links in
> > structures, like vmlilst used by get_vmalloc_info in fs/proc/mmu.c or
> > ulp->proc_list used by loop_undo in ipc/sems.c .
> > [...]
> > Do you think SystemTap can help me finding the culprit ?
> > [...]
> 
> Perhaps.  Does the memory corruption occur in predictable places?
> Imagine a probe that runs periodically (via a frequently triggered
> timer, or a breakpoint at a code point under suspicion).  That probe
> could look through selected places that are corrupted, and check for
> something suspicious.

Up to know, each run (3 of them) has produced a Oops in a different
place (in a different linked list).

Using more options in .config now leads to a crash at the moment the
memory is corrupted. Seems the code I'm trying to test is the culprit !
A suggestion: add a basic SystemTap code to the kernel when these
options are used (memory leak debug, compile kernel with frame ...,
write protect kernel read-only data ...), so that it helps understanding
which code is writing in the wrong places.


> For example:
> 
>   #! stap -g
>   probe kernel.function("after_your_function") { if (checkstuff ()) log ("bug") }
>   function checkstuff () /* .... */
> 
> What checkstuff() does depends on how a program may be able to assess
> corruption.  If it's ascii scripts showing up within known regions of
> valid memory, something like this naive search could do it.  (Such a
> function could be encapsulated into the systemtap tapset library).
> 
>   function checkstuff () %{
>     char *begin = 0xdeadbeef;
>     char *end = 0xdeadf00d;
>     int found = 0;
>     char *p;
>     for (p = begin; p+3 < end; p++)
>       if (p[0] == 'a' && p[1] == 't' && p[2] == 'a') found=1;
>     THIS->__retval = found;
>   %}
> 
> Later, we will have hardware-assisted watchpoint probes that hit when
> a designated area of memory is read and/or written.  That could narrow
> the culprits down even further.  This might look something lke:
> 
>   probe kernel.watch.from(0xdeadbeef).to(0xdeadfood).string("ata")
>     { log ("bug") }
> 
> 
> Anyway, this all depends on being able to characterize the corruption
> well enough that a routine could be written to safely check for it.

I think I've got this very important information by recompiling the
kernel with the options I talked here before (Kernel Hacking).

> If you don't have even that much information, very drastic measures
> may be necessary (such as running the kernel under a simulator or
> debugger).

Yes. We've talked about that with colleagues ... UML, ...
They had to fix bugs in the tools before being able to find their
problem ...

Regards,

Tony

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]