This is the mail archive of the systemtap@sources.redhat.com mailing list for the systemtap project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: architecture paper draft

From: "Frank Ch. Eigler" <fche at redhat dot com>
To: Vara Prasad <prasadav at us dot ibm dot com>
Cc: systemtap at sources dot redhat dot com
Date: Thu, 3 Feb 2005 08:41:57 -0500
Subject: Re: architecture paper draft
References: <20050127212504.GH22921@redhat.com> <4201EFCA.2080501@us.ibm.com>
Hi, Vara -


> I am not sure if my original comments reached through listserver or 
> not. [...]

Sourceware is configured to block certain types of attachments from
casual email.  Everything works nicely if we keep with simple ASCII
text.

> Here are some of my comments on the draft paper.

Thank you for such an in-depth reading!


> [...]
> General: Paper doesn't seem to refer systemtap as SystemTAP the original
> name we started with. 

While I am unfond of CaMeLcApS, if others insist, fine with me as a title.
It would be painful to read intermingled capitals throughout ordinary
body text though.

> Paper also seems to use breakpoint to refer to a probe location.  I
> think it would be less confusing if we use probe point rather than
> break point.

OK.

> In the motivation section you mentioned "Red Hat is forming a new 
> project named systemtap" don't you think you should include IBM here.

Whom?  Just kidding, OK.

> You mentioned "The output should be available in multiple formats".  
> Do you mean by output in text and graphics format or something else.

That's all I meant.

> In the requirements we should add [...]
> Users should be able to trace the system without needing to develop 
> their own probes.

OK.

> We should add a definitions section before we go to describe the 
> architecture. [...]

Added a "Terminology" section before "Probe language".

> Probe Group: A group of probes in a given functional area of the kernel.
>                 For example probe handlers for all the system calls.

I suggest leaving this one out until the provider concept solidifies more.
If "probe group" ends up being a simple "group of probe points", then a
specific definition is not needed.

> How do we uniquely identify a probe in the system.
> <probe group>:<function >:<where>
> examples for kernel probes would be syscall:write:entry, vm:pagin:: etc.

If people are fonder of colons than periods as a separator for this,
I don't mind switching over.  I think that dtrace's fixed three-tuple
is too constricting and that we should design a way of using 
potentially deep namespaces.  C's normal nested-structure/field
dotted syntax seems like a good match for this.

> One issue with the above notation is how do we specify the probes for
> user space. From an implementation point of view inode and offset 
> uniquely identifies a place to put the trap instruction [...]

... and I suspect this may still be valuable to retain, as one
"coordinate system" for specifying probe points.

> from the user's of view it should be more in terms of the process. [...]

That too, though when referring to points within a process,

>  [...]
> Another related area that we need to have some details in the spec 
> is user space probes and how we handle them [...]

We can have such a section as soon as a plausible design is offered.

> Probe language
> I like the idea of probe as the keyword rather than the break, 
> similarly for globals identify a section called globals [...]

OK.  With a real parser (coming soon), such details will be
straightforward to experiment with.

> The syntax illustrated in the probe language section is more "C" 
> like than script like.
> I like the syntaxt used in the dtr proof of concept more than the 
> one in the probe language section.

I must say I don't quite understand this.

The "execsnoop", "fork", and "shellsnoop." dtr samples are
totally C and not script-like at all, right down to C pointer
manipulation and declarations.  The remainder ("test1", "test2")
would look as simple or simpler in systemtap.

> I would prefer to use self or this instead of user to refer to the 
> caller.

The "user." prefix was meant to denote access to user-context data
(which is not proper to access from some kernel contexts), not
merely the caller of the function being probed.

> For frequently used datastructures like current and pid we should
> have a macro to refer to them like $CURRENT, $PID etc.

Rather than a dollar-syntax, I would prefer the provider concept to
provide a variable namespace similar to the probe point namespace,
where abstract names like "user.current.comm" would map to C code
that expands to a code fragment vaguely like

    string value;
    if (in_user_context ()) strncpy (value, current->comm, MAXSTRINGLEN);
    else { context_errors ++; value[0] = '\0'; }

This would be what might be termed a "data provider" rather than
a "probe point provider".

> We should also provide constructs like iterators to traverse lists 
> of datastructures in the kernel that probes might access. [...]

Good idea.  How might such iterators look in the probe language?
What would they expand to in C?  (Remember, there are several types
of list structures in the kernel.)

> I think we should also look at scripting language like perl or 
> python and adapt extensions from those, These languages have
> all the characteristics of awk from our needs point of view.

Perhaps, but if we are staying with the idea of translating to C,
those languages are a much bigger "target" to parse.  If you are
contemplating putting a python or perl interpreter into the kernel,
well ...

> Using awk would make people feel we are copying Sun, just a thought.

That would not bother me at all.

> The paper only seems to deal with function entry and exit probes 
> [...]
> they need to able to add probes in the middle of the function. We 
> have to mention what [...]

I will emphasize it as one of the key features enabled by parsing of
debug information.


> Elaboration:
> The definition of SystemTAP provider in this section is not precise, 
> one example refers to access to global data other example refers to
> probe location. [...]

Indeed.  See my distinction of "data provider" above.


> Translation
> What is a runaway-prevention logic?  You mean infinite loops.

Yes, plus recursion, or simply excessive processing.  "Excessiveness"
would be a criterion that limits the execution duration of probe
functions to some quantity (time or abstract steps) that may be
set small enough to limit the impact of probes on the system.  The
theory is that ever long iterations over, say, accumulated statistics
arrays are harmful, and that it might be better to abort the
computation and suggest rerunning the whole systemtap script
at a different session or with different predicates.

I envision such a time or step-limit quantity being specified at
systemtap translation or startup time, and being implemented by
explicit counters around control structures in the emitted code.

> Paper mentions variables shared among probes, i am thinking you are 
> referring to globals of the probe module here, am i right.

Yes.

> Paper also mentions "Each group of local variables is placed into a 
> synthetic call frame structure that keeps them off the tiny real
> kernel stacks.". It is not clear to me which local variables
> you are referring to. 

Those variables and intermediate values that are used in systemtap
probe functions but are not global.

> What do you mean by "tiny real kernel stacks". 
> [...]

I was referring to the unfortunate fact that stack space in the kernel
is very scarce - 4-8K total on some architectures.  The translated code
cannot assume it can put even a couple of string buffers on the stack.
Rather, I envision the translator emit explicit structs that simulate
those stack frames, which would be allocated statically or on the heap.

> [...]
> This section doesn't address how we are addressing the issue of 
> statically defined structures which are passed in as arguments
> to the probing function and we are generating a jprobes module.

You mean how type declarations would be found?  Indeed: that is one
of my sources of unease for the jprobes style.  With kprobes at least,
the situation is clear: dwarf data must be read and used to traverse
structures.  The script syntax needed to express that is still up in
the air.


> Output
> I think getting the output via pritnk is in messagelog makes it 
> difficult to parse the output, and display in a readable form.
> I would strongly suggest staying away from the printk approach.

I agree completely.

> Paper mentions that systemtap infers relationship between arrays, 
> where do you think we can do this in user space or kernel and i
> think we need to provide more details of how is implemented.

I envison the systemtap driver program making guesses about
relationships between the various global/reportable variables.
After the kernel module dumps its shutdown-time state, the
user-space driver program would read and format this data.
I don't know how precisely the formatting heuristics might work.

> Formatted output paper mentioned, like XML, i am assuming is done in 
> user space not in Kernel, community will not approve XML formatting
> in kernel.

Well, anything reasonably easy and unambiguous to generate in the
kernel & parse in user space would do.  I'm not sure XML is unsuitable
for the bulk data, especially given that compactness is not such a big
concern.  What internal format do you suggest?


> Security
> I am not sure how many users really like to analyze their programs 
> without admin privileges, hence i suggest systemtap should be
> runnable only by root users.

On the other hand, there are some customer requests for probing by
unprivileged users too.  If we can think of some way to enable that
safely, that would be awesome.

> [...]  We should provide more specifics with examples of how we address
> the safety issues for common problems like divide by zero, accessing
> address space outside the scope of the process, etc. 

The answer, for the first few of these, is that the translator would
emit checking code in line with any of these dangerous little operators.
An "a / b" operation in the probe script would be translated to something
like

	long a, b, result;
	/* ... */
	if (b == 0) { arithmetic_errors ++; result = 0; }
        else { result = a / b; }

Something similar for pointer dereferencing (though in what way pointers
have to be exposed at the probe script level is not clear).

> I am not sure there is any concern about C compilers reliability [...]

I agree.  That was put there based on concerns from Intel.

> The main issue I think here is an interpreter has more control on the
> code being interpreted when there are errors unlike an executable code.
> [...]

That's true, but we can mitigate that by anticipating errors in the
translator and emitting explicit checks for them.


- FChE
Attachment: pgp00000.pgp
Description: PGP signature
References:
- architecture paper draft
  - From: Frank Ch. Eigler
- Re: architecture paper draft
  - From: Vara Prasad
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]