This is the mail archive of the systemtap@sources.redhat.com mailing list for the systemtap project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Skeleton detailed design document

From: Vara Prasad <prasadav at us dot ibm dot com>
To: systemtap at sources dot redhat dot com
Date: Thu, 31 Mar 2005 00:48:23 -0800
Subject: Skeleton detailed design document

Frank had started an initial design document but we have not done any updates to it after that. I have taken Frank's document added few more sections that i think are relevant based on the discussions we have been having. Once we think we got the basic sections i will commit that skeleton to the cvs. I am thinking people who are driving the individual sections can fill them up and commit to CVS so we will have more complete design document.

Please let me know your thoughts on the attached Linux text document.
If people prefer inline let me know i can make it inline.

Architecture of systemtap: a Linux trace/probe tool

This paper outlines a proposed architecture for systemtap, a new
tool for Linux tracing and probing, and provides necessary background
and includes a project plan.

Frank Ch. Eigler <fche@redhat.com>
Anon Y Mous <foo@dodgeit.com>
Vara Prasad <prasadav@us.ibm.com>
Will Cohen <wcohen@redhat.com
Hien Nguyen <hien@us.ibm.com>
Martin Hunt <hunt@redhat.com>
Jim Keniston <jkenisto@us.ibm.com>
Brad Chen <brad.chen@intel.com>



MOTIVATION

A tracing and probing tool gives knowledgeable users a deep insight
into what is going on inside the operating system, going well beyond
isolated tools like netstat, ps, top, iostat.

REQUIREMENTS

Systemtap is designed to strike a useful balance between several
requirements.

Ease of use: The tool's probe language should be simple and
compact.  The output should be available in multiple formats.  Users
should be able to reuse general scripts written by others.

Extensibility: The tool should allow subsystem experts to
provide extensions that expose interesting data in those subsystems
safely.  The tool should deal with the constant drift of kernel
versions.

Performance: The probes should execute fast enough that users
are not discouraged from their liberal use in a live system.  It
should be efficient on multiprocessor systems.  

Transparency: It should be possible for an expert to see details
of the tool's operation, so they can convince themselves of its safety,
accuracy.  The tool should itself be free software, and its intermediate
outputs should be potentially visible.

Simplicity: The tool must not take too long to develop,
document, and deploy.
Flexibility: The tool should run on a spectrum of
processor architectures and kernel versions.  Both kernel and user
space programs should be instrumentable, even in the absence of
source code.

Safety: It should live within the many constraints of operation
within the kernel.  It should prevent unintentional interference.

SYSTEMTAP PROCESSING STEPS

Systemtap is structured in a straightforward pipeline shown in
figure. The steps are detailed below.
<Figure will be filled later>

TERMINOLOGY

probe point: A program location in the kernel or user
code where control is intercepted by systemtap.  This location might
be specified logically (subsystem entry point names) or physically
(function/symbol names, addresses, source coordinates).

probe handler: A subroutine written in systemtap script,
which is executed when a given probe point is hit.

PROBE DEFINITION
<The section classifies types of probes in systemtap and also describes what 
are all the ways to specify probes in systemtap.> 

PROBE LANGUAGE
<This section describes what language constructs that are allowed
in writing the scripts. This section also specifies the grammer for
the entire supported language, so script writers can refer to it>

The systemtap input consists of a script, written in a simple
language.  It could be flavoured much like dtrace's ``D'', itself
inspired by the old UNIX tool awk.  These resemble a simplified
C, lacking types, declarations, and indirection, but adding
associative arrays and simple string processing.  These are a good
match for systemtap's requirements.

Figure shows a sample that would perform tracking of
file write traffic.  It illustrates several syntactic specifics
such as including associative arrays, loops, reporting directives.

Systemtap script to track file write I/O traffic
report write_page_count, page_dirtying_fns;

probe kernel.function("sys_write") {
  process_names [user.current.pid] = user.current.comm;
  active_writers [user.current.pid] = 1;
}
probe kernel.function("sys_write").return {
  active_writers [user.current.pid] = 0;
}

/* explicit page dirtying corresponds to filesystem writes */ 
probe kernel.function("set_page_dirty") {
  for (pid in active_writers) 
    write_page_count [process_names [pid]] ++;
  symbol = kernel.ksymbol [kernel.backtrace [1]];
  page_dirtying_fns [symbol] ++;
}


This input language needs considerable refinement.
Specifically, to deal with low-level interoperation with a C target,
it may need to include complex expressions containing type casting and
indirection.  To instrument a C++ or Java user-level applications, the
probe language must allow object traversal, but at the same time
retain a harmonious syntax.  We believe some C-like syntactic
abstraction of the general DWARF expression concepts would suffice.

ELABORATION

Elaboration is a processing phase that analyzes the input script, and
resolves any needed symbolic references to the kernel, user programs,
or other data ``providers'' or "tapsets". This is similar to linking 
an object file with its libraries, to turn it into a self-contained 
customized executable for the current host.

References to kernel data such as function parameters, local and
global variables, functions, source locations, all need to be resolved
to actual run-time addresses.  This is most rigorously done by
processing the DWARF debugging information emitted by the compiler, in
the same way as an ordinary debugger would.  However, such debug data
processing must be transformed into an executable form strictly
ahead-of-time, so that during actual probe execution, no explicit
decoding is necessary.

Debugging data contains enough information to locate inlined copies of
functions (very common in the Linux kernel), local variables, types,
and declarations beyond what are ordinarily exported to kernel
modules.  It enables placement of probe points into the interior of
functions.  Systemtap should exploit this extra access, which is
simply not possible for a proprietary package that omits debug data.


TRASLATION

Once an entire set of probe functions is processed through the
elaboration stage, they are translated to a quantity of C code.

Briefly, each probe function is mapped to a stylized kprobe or jprobe
function.  Each systemtap operator is expanded to a block of C that
includes whatever locking and safety checks are necessary.  Each
systemtap control-flow blocking construct is mapped to a block of C
that includes runaway-prevention logic.  Each variable shared amongst
probes is mapped to an appropriate static declaration, and accesses
are protected by locks.  Each group of local variables is placed into
a synthetic call frame structure that keeps them off the tiny real
kernel stacks.

Supporting the instrumentation of user-level code may be a
straightforward extension of kernel-space support.  The probe points
would need to be inserted into specific processes' executable
segments, using a mechanism yet to be built.  (The existing dprobes
inode-specific probe points are not a perfect match for the sort of
per-user instrumentation we envison.)

The generated code includes a copy of common runtime that provides
routines for generic lookup tables, constrained memory managent,
startup, shutdown, and I/O.

When complete, the generated C code is compiled into a stand-alone
kernel module.  For security reasons, the module might be
cryptographically signed, so that it may be archived and later reused
here, or on another computer without a compiler installed.

RUN TIME LIBRARY
<This section gives the details of the run time library functions
available in systemtap, what the API's are and how they are 
implemented.>

EXECUTION

To run the probes, the systemtap driver program simply loads the
kernel module using insmod.  The module will initialize itself,
insert the probes, and start accumulating data.  

Individual probes should run holding as few locks as possible.  It may
be reasonable to hold only individual spinlocks while manipulating
shared systemtap variables.  On the other hand, it is necessary to
hold no locks while calling non-user-context kernel functions
like copy_from_user and accessing variables like current.

Some locking policies make it possible to have race conditions amongst
probes that may run physically concurrently in a multiprocessor
system.  (Imagine distinct probes manipulating shared arrays in a
different order.)

The probe run concludes when the user sends an interrupt to the
driver, or when the probe script runs an exit primitive.  (This
primitive might simply send a SIGINT to the running user-level driver
process.)

USER TO KERNEL TRANSPORT
<This section describes in details what are the requirements for
a transport in systemtap and what we considered as possible transports
and why we choose one>

OUTPUT

Depending on the primitives used in the systemtap script, output may
flow gradually via logging streams (printk, netlink},
etc.), or in large batches (/proc files).  In some cases,
systemtap would infer the relationship between arrays, indexes, and
automatically format related results in a natural combined way.  For
example, if systemtap notices that three separate arrays are always
indexed by the same variable, in the output it can combine the three
arrays into a four-column listing (sharing the index rows).

Other than a simple textual form, systemtap should also be able to
emit the overall data in a structured computer-parsable form such as
XML, or into other forms easily loaded by graphics generator programs.

SECURITY

<This section addresses issues of who can run what scripts, can
you get access to the information by running the scripts that you
are normally authorized to get etc. This section does not
address the issues of safety, they are addressed in the safety section.>

Because systemtap deals with kernel modules, it must be run by a user
with administrative privileges.  Similarly, by providing such a
potentially unlimited view into a running kernel, preservation of
multi-user privacy requires systemtap to be an administrative tool.

If a systemtap probe script instruments only user-level state, such as
a specific program or library running under the invoking user, then it
is safer to let a non-administrative user run that systemtap tool.  It
is possible that systemtap will detect this case during elaboration,
and permit its use via setuid.

SAFETY
<Since this tool will be running in production environments it
is extremely important to make sure running a systemtap script
doesn't destabilize the system. This section describes what is
done in systemtap to prevent that.>

There may be some concern that by placing all the safety checking
logic into the systemtap elaborator, and all the translation logic
into the system C compiler, security vulnerabilities may exist.  On
the other hand, a curious expert may inspect systemtap's generated C
probe file for security problems.

Some security concerns may be cast as an impression that a bytecode
interpreter like that in dtrace and dprobes is somehow inherently more
``secure''.  These may be addressed by observing that whatever checks
an interpreter can perfrom in situ can be done as well by explicit
generated C code (which has far more context available).  If the
concern might then turns to hypothetical unreliability of the C
compiler, we may explain that a mature compiler that is used to build
the entire Linux distribution should be just as trustworthy when
building probe kernel modules as building the rest of the system.

Other projects

<This section compares Systemtap with other equivalent projects like
dtrace, dprobes, kerninst etc.>

TAPSETS

<This section defines what are tapsets, equivalent of Dtrace providers,
how they are defined, how they are invoked in the systemtap scripts.
The section also describes what steps a developer has to take and what
interfaces to support and use while writing a tapset.>

Systemtap providers are snippets of script code in a library, which
provide variables for use by user scripts.  They accomplish this by
implicitly adding other probe functions that manage those variables.
For example, a provider snippet may maintain a process-id to
process-name mapping table by hooking to the exec/exit system calls.
An oprofile provider may allow timer- or counter-based
probe points.

Other types of providers may identify promising probe points, perhaps
by supplying explicit C function signatures suitable for jprobes.

Yet other types of providers may include pieces of actual C code to be
macro-expanded into the generated probe source, similarly to yacc
%% markers.  This mechanism, for use by systemtap developers, could
deal with the sorts of correctness logic that detects errors such as
making user-context kernel API calls from an interrupt context.

The whole ``provider'' concept needs considerable refinement and
extension.

KPROBES AND ENHANCEMENTS FOR SYSTEMTAP
<This section describes the basic Kprobes in brief. The main idea of
this section is to describe what enhancements are done to kprobes
to meet the needs of systemtap, their design and uses>

RETURN PROBES
<This section describes how return probes are implemented and 
alternatives considered.>

MULTIPLE PROBES
<This section describes how multiple probes at a given probe
address are implemented and what are the choice conisidered before
choosing the implemented design>

Project plan
<This section describes how we plan to release this project in
phases and what phase will contain what features>

Follow-Ups:
- Re: Skeleton detailed design document
  - From: Frank Ch. Eigler

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]