Pages of generic tracing text may give you enough information for exploring a system. With systemtap, it is possible to analyze that data, to filter, aggregate, transform, and summarize it. Different probes can work together to share data. Probe handlers can use a rich set of control constructs to describe algorithms, with a syntax taken roughly from awk. With these tools, systemtap scripts can focus on a specific question and provide a compact response: no grep needed.
Most systemtap scripts include conditionals, to limit tracing or other logic to those processes or users or whatever of interest. The syntax is simple:
if (EXPR) STATEMENT [else STATEMENT] | if/else statement |
while (EXPR) STATEMENT | while loop |
for (A; B; C) STATEMENT | for loop |
Scripts may use break/continue as in C. Probe handlers can return early using next as in awk. Blocks of statements are enclosed in { and }. In systemtap, the semicolon (;) is accepted as a null statement rather than as a statement terminator, so is only rarely2 necessary. Shell-style (#), C-style (/* */), and C++-style (//) comments are all accepted.
Expressions look like C or awk, and support the usual operators, precedences, and numeric literals. Strings are treated as atomic values rather than arrays of characters. String concatenation is done with the dot ("a" . "b"). Some examples:
(uid() > 100) | probably an ordinary user |
(execname() == "sed") | current process is sed |
(cpu() == 0 && gettimeofday_s() > 1140498000) | after Feb. 21, 2006, on CPU 0 |
"hello" . " " . "world" | a string in three easy pieces |
Variables may be used as well. Just pick a name, assign to it, and use it in expressions. They are automatically initialized and declared. The type of each identifier – string vs. number – is automatically inferred by systemtap from the kinds of operators and literals used on it. Any inconsistencies will be reported as errors. Conversion between string and number types is done through explicit function calls.
foo = gettimeofday_s() | foo is a number |
bar = "/usr/bin/" . execname() | bar is a string |
c++ | c is a number |
s = sprint(2345) | s becomes the string ”2345” |
By default, variables are local to the probe they are used in. That is, they are initialized, used, and disposed of at each probe handler invocation. To share variables between probes, declare them global anywhere in the script. Because of possible concurrency (multiple probe handlers running on different CPUs), each global variable used by a probe is automatically read- or write-locked while the handler is running.
# cat timer-jiffies.stp global count_jiffies, count_ms probe timer.jiffies(100) { count_jiffies ++ } probe timer.ms(100) { count_ms ++ } probe timer.ms(12345) { hz=(1000*count_jiffies) / count_ms printf ("jiffies:ms ratio %d:%d => CONFIG_HZ=%d\n", count_jiffies, count_ms, hz) exit () } # stap timer-jiffies.stp jiffies:ms ratio 30:123 => CONFIG_HZ=243
A class of special “target variables” allow access to the probe point context. In a symbolic debugger, when you’re stopped at a breakpoint, you can print values from the program’s context. In systemtap scripts, for those probe points that match with specific executable point (rather than an asynchronous event like a timer), you can do the same.
In addition, you can take their address (the & operator), pretty-print structures (the $ and $$ suffix), pretty-print multiple variables in scope (the $$vars and related variables), or cast pointers to their types (the @cast operator), or test their existence / resolvability (the @defined operator). Read about these in the manual pages.
To know which variables are likely to be available, you will need to be familiar with the kernel source you are probing. In addition, you will need to check that the compiler has not optimized those values into unreachable nonexistence. You can use stap -L PROBEPOINT to enumerate the variables available there.
Let’s say that you are trying to trace filesystem reads/writes to a particular device/inode. From your knowledge of the kernel, you know that two functions of interest could be vfs_read and vfs_write. Each takes a struct file * argument, inside there is either a struct dentry * or struct path * which has a struct dentry *. The struct dentry * contains a struct inode *, and so on. Systemtap allows limited dereferencing of such pointer chains. Two functions, user_string and kernel_string, can copy char * target variables into systemtap strings. Figure 5 demonstrates one way to monitor a particular file (identified by device number and inode number). The script selects the appropriate variants of dev_nr andinode_nr based on the kernel version. This example also demonstrates passing numeric command-line arguments ($1 etc.) into scripts.
# cat inode-watch.stp probe kernel.function ("vfs_write"), kernel.function ("vfs_read") { if (@defined($file->f_path->dentry)) { dev_nr = $file->f_path->dentry->d_inode->i_sb->s_dev inode_nr = $file->f_path->dentry->d_inode->i_ino } else { dev_nr = $file->f_dentry->d_inode->i_sb->s_dev inode_nr = $file->f_dentry->d_inode->i_ino } if (dev_nr == ($1 << 20 | $2) # major/minor device && inode_nr == $3) printf ("%s(%d) %s 0x%x/%u\n", execname(), pid(), ppfunc(), dev_nr, inode_nr) } # stat -c "%D %i" /etc/crontab fd03 133099 # stap inode-watch.stp 0xfd 3 133099 more(30789) vfs_read 0xfd00003/133099 more(30789) vfs_read 0xfd00003/133099
Functions are conveniently packaged reusable software: it would be a shame to have to duplicate a complex condition expression or logging directive in every placed it’s used. So, systemtap lets you define functions of your own. Like global variables, systemtap functions may be defined anywhere in the script. They may take any number of string or numeric arguments (by value), and may return a single string or number. The parameter types are inferred as for ordinary variables, and must be consistent throughout the program. Local and global script variables are available, but target variables are not. That’s because there is no specific debugging-level context associated with a function.
A function is defined with the keyword function followed by a name. Then comes a comma-separated formal argument list (just a list of variable names). The { }-enclosed body consists of any list of statements, including expressions that call functions. Recursion is possible, up to a nesting depth limit. Figure 6 displays function syntax.
# Red Hat convention; see /etc/login.defs UID_MIN function system_uid_p (u) { return u < 500 } # kernel device number assembly macro function makedev (major,minor) { return major << 20 | minor } function trace_common () { printf("%d %s(%d)", gettimeofday_s(), execname(), pid()) # no return value necessary } function fibonacci (i) { if (i < 1) return 0 else if (i < 2) return 1 else return fibonacci(i-1) + fibonacci(i-2) }
Often, probes will want to share data that cannot be represented as a simple scalar value. Much data is naturally tabular in nature, indexed by some tuple of thread numbers, processor ids, names, time, and so on. Systemtap offers associative arrays for this purpose. These arrays are implemented as hash tables with a maximum size that is fixed at startup. Because they are too large to be created dynamically for individual probes handler runs, they must be declared as global.
global a | declare global scalar or array variable |
global b[400] | declare array, reserving space for up to 400 tuples |
The basic operations for arrays are setting and looking up elements. These are expressed in awk syntax: the array name followed by an opening [ bracket, a comma-separated list of index expressions, and a closing ] bracket. Each index expression may be string or numeric, as long as it is consistently typed throughout the script.
foo [4,"hello"] ++ | increment the named array slot |
processusage [uid(),execname()] ++ | update a statistic |
times [tid()] = get_cycles() | set a timestamp reference point |
delta = get_cycles() - times [tid()] | compute a timestamp delta |
Array elements that have not been set may be fetched, and return a dummy null value (zero or an empty string) as appropriate. However, assigning a null value does not delete the element: an explicit delete statement is required. Systemtap provides syntactic sugar for these operations, in the form of explicit membership testing and deletion.
if ([4,"hello"] in foo) { } | membership test |
delete times[tid()] | deletion of a single element |
delete times | deletion of all elements |
One final and important operation is iteration over arrays. This uses the keyword foreach. Like awk, this creates a loop that iterates over key tuples of an array, not just values. In addition, the iteration may be sorted by any single key or the value by adding an extra + or - code.
The break and continue statements work inside foreach loops, too. Since arrays can be large but probe handlers must not run for long, it is a good idea to exit iteration early if possible. The limit option in the foreach expression is one way. For simplicity, systemtap forbids any modification of an array while it is being iterated using a foreach.
foreach (x = [a,b] in foo) { fuss_with(x) } | simple loop in arbitrary sequence |
foreach ([a,b] in foo+ limit 5) { } | loop in increasing sequence of value, stop after 5 |
foreach ([a-,b] in foo) { } | loop in decreasing sequence of first key |
When we said above that values can only be strings or numbers, we lied a little. There is a third type: statistics aggregates, or aggregates for short. Instances of this type are used to collect statistics on numerical values, where it is important to accumulate new data quickly (without exclusive locks) and in large volume (storing only aggregated stream statistics). This type only makes sense for global variables, and may be stored individually or as elements of an array.
To add a value to a statistics aggregate, systemtap uses the special operator <<<. Think of it like C++’s << output streamer: the left hand side object accumulates the data sample given on the right hand side. This operation is efficient (taking a shared lock) because the aggregate values are kept separately on each processor, and are only aggregated across processors on request.
a <<< delta_timestamp writes[execname()] <<< count
To read the aggregate value, special functions are available to extract a selected statistical function. The aggregate value cannot be read by simply naming it as if it were an ordinary variable. These operations take an exclusive lock on the respective globals, and should therefore be relatively rare. The simple ones are: @min, @max, @count, @avg, and @sum, and evaluate to a single number. In addition, histograms of the data stream may be extracted using the @hist_log and @hist_linear. These evaluate to a special sort of array that may at present3 only be printed.
@avg(a) | the average of all the values accumulated into a |
print(@hist_linear(a,0,100,10)) | print an “ascii art” linear histogram of the same data stream, bounds 0…100, bucket width is 10 |
@count(writes["zsh"]) | the number of times “zsh” ran the probe handler |
print(@hist_log(writes["zsh"])) | print an “ascii art” logarithmic histogram of the same data stream |
The full expressivity of the scripting language raises good questions of safety. Here is a set of Q&A: