This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

a framework of generating whitelist for safe-mode probes


Hi,

Here is an implementation of generating whitelist for safe-mode probes
based on the discussions in former thread
(http://sourceware.org/ml/systemtap/2006-q3/msg00574.html).

Its main idea is:
 1) Fetch a group of probe points from probes.pending, probe them and
    run some workload(e.g. runltp -t 60s) meanwhile.

 2) If the probe test ends without crash, those actually triggered
    probe points are moved into probes.passed and those untriggered
    are into probes.untriggered;
    If the probe test crashes the system, it will be resumed
    automatically after system reboot. Those probe points which have
    been triggered are also moved into probes.passed, but those
    untriggered ones are moved into probes.failed.

 3) Repeat the above until probes.pending becomes empty, then:
    Normally, probes.pending is reinitialized from probes.failed
    (or probes.untriggered if probes.failed is empty) and start the
    next iteration;
    But if max iteration limit (e.g. 5) is reached, or probes.pending,
    probes.failed and probes.untriggered are all empty, stop the
    whole test.

To be able to resume after a crash, this test will register itself as
the last system service at the beginning and unregister itself at the
end.

I also use a script in a remote server to restart the test machine
automatically if it is crashed.

Usage:
  runtest whitelist.exp
  * Please remove "/stp_genwhitelist_running" first if you want to
    restart the test from the scratch.

Rough result on my 2.6.18/ppc64:
root:/home/root/testsuite/systemtap.stress>wc -l probes.*
 13209 probes.all         * all input probe points
     2 probes.failed      * causing crash and unrecorded in probe.out
  1146 probes.passed      * whitelist,for those recorded at least once
 12061 probes.untriggered * probed without crash but unrecorded

Some problems:
1) How to choose workload to trigger as much as many probe points?
   Now only 9% of all are triggered when using "runltp -t 60s".
2) How to set the proper sizes of each group for different iteration
   levels? A smaller group size will help find those probe points
   really crashing the system, but also causes longer running time.
   After all, there are 13209 probe points for kernel.function("*")
   and 31132 for kernel.inline("*").
   btw: do we need to handle module("*").function("*") ?

Any comments? Thanks.

-Guijian

Attachment: genwhitelist.1017.tgz
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]