This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: what does 'probe process(PID_OR_NAME).clone' mean?

From: David Smith <dsmith at redhat dot com>
To: Systemtap List <systemtap at sources dot redhat dot com>, Roland McGrath <roland at redhat dot com>
Date: Wed, 28 May 2008 15:37:01 -0500
Subject: Re: what does 'probe process(PID_OR_NAME).clone' mean?
References: <483D9456.4070107@redhat.com> <483D9542.1040007@redhat.com>
David Smith wrote:
> Your new formulation really doesn't wash with me.
> Rather than a coherent response to your message,
> I'll just dump a bunch of related thoughts.

Thanks for this dump (although you owe me at least 2 tylenol for the
headache).

I'll comment on various parts below.

...

> The clone event is an event that the parent thread experiences before
> the child is considered to have been born and experienced any events.
...
>    The key feature of the report_clone callback is that this is the
>    opportunity to do things to the new thread before it starts to run.
>    Before this (such as at syscall entry for clone et al), there is no
>    child to get hold of.  After this, the child starts running.  At any
>    later point (such as at syscall exit from the creating call), the
>    new thread will get scheduled and (if nothing else happened) will
>    begin running in user mode.  (In the extreme, it could have run,
>    died, and then the tid you were told already reused for a completely
>    unrelated new thread.)  During report_clone, you can safely call
>    utrace_attach on the new thread and then make it stop/quiesce,
>    preventing it from doing anything once it gets started.
> 
> Another note on report_clone: this callback is not the place to do much
> with the child except to attach it.  If you want to do something with
> the child, then attach it, quiesce it, and let the rest of clone finish
> in the parent--this releases the child to be scheduled and finish its
> initial kernel-side setup.

Ah, interesting.  Now that I didn't know.  The current code is certainly
doing more than it is supposed to in report_clone.

Is this also true for any other events?  Currently we're using
UTRACE_EVENT({CLONE, DEATH, EXEC, SYSCALL_ENTRY, SYSCALL_EXIT}), but
this list could expand in the future.

> All of that discussion was about the implementation perspective that is
> Linux-centric, low-level, and per-thread, considering one thread (task
> in Linuxspeak) doing a clone operation that creates another task.  In
> common terms, this encompasses two distinct kinds of things: creation
> of additional threads within a process (pthread_create et al), and
> process creation (fork/vfork).  At the utrace level, i.e. what's
> meaningful at low level in Linux, this is distinguished by the
> clone_flags parameter to the report_clone callback.  Important bits:
> 
> * CLONE_THREAD set
>   This is a new thread in the same process; child->tgid == parent->tgid.
> * CLONE_THREAD clear
>   This child has its own new thread group; child->tgid == child->pid (tid).
>   For modern use, this is the marker of "new process" vs "new thread".
> * CLONE_VM|CLONE_VFORK both set
>   This is a vfork process creation.  The parent won't return to user
>   (or syscall exit tracing) until the child dies or execs.
>   (Technically CLONE_VFORK can be set without CLONE_VM and it causes
>   the same synchronization.)
> * CLONE_VM set
>   The child shares the address space of the parent.  When set without
>   CLONE_THREAD or CLONE_VFORK, this is (ancient, unsupported)
>   linuxthreads, or apps doing their own private clone magic (happens).
> 
> For reference, old ptrace calls it "a vfork" if CLONE_VFORK is set,
> calls it "a fork" if &CSIGNAL (a mask) == SIGCHLD, and otherwise calls
> it "a clone".  With syscalls or normal glibc functions, common values
> are:
> 
> fork	 	-- just SIGCHLD or SIGCHLD|CLONE_*TID
> vfork		-- CLONE_VFORK | CLONE_VM | SIGCHLD
> pthread_create	-- CLONE_THREAD | CLONE_SIGHAND | CLONE_VM
> 		     | CLONE_FS | CLONE_FILES
> 		     | CLONE_SETTLS | CLONE_PARENT_SETTID
> 		     | CLONE_CHILD_CLEARTID | CLONE_SYSVSEM
> 
> Any different combination is some uncommon funny business.  (There are
> more known examples I won't go into here.)  But probably just keying on
> CLONE_THREAD is more than half the battle.

Hmm, OK - I'm seeing the difference between fork/vfork/pthread_create
here.  (But I get confused later...)

> For building up to the user's natural perspective on things, I like an
> organization of a few building blocks.  First, let me describe the idea
> of a "tracing group".  (For now, I'll just talk about it as a semantic
> abstraction and not get into how something would implement it per se.)
> By this I just mean a set of tasks (i.e. threads, in one or more
> processes) that you want to treat uniformally, at least in utrace
> terms.  That is, "tracing group" is the coarsest determinant of how you
> treat a thread having an event of potential interest.  In utrace terms,
> all threads in the group have the same event mask, the same ops vector,
> and possibly the same engine->data pointer.  In systemtap terms, this
> might mean all the threads for which the same probes are active in a
> given systemtap session.  The key notion is that the tracing group is
> the granularity at which we attach policy (and means of interaction,
> i.e. channels to stapio or whatnot).
> 
> In that context, I think of task creation having these components:
> 
> 1. clone event in the parent
> 
>    This is the place for up to three kinds of things to do.
>    Choices can be driven by the clone_flags and/or by inspecting
>    the kernel state of the new thread (which is shared with the parent,
>    was copied from the parent, or is freshly initialized).
> 
>    a. Decide which groups the new task will belong to.
>       i.e., if it qualifies for the group containing the parent,
>       utrace_attach it now.  Or, maybe policy says for this clone
>       we should spawn a new tracing group with a different policy.
> 
>    b. Do some cheap/nonblocking kind of notification and/or data
>       structure setup.
> 
>    c. Decide if you want to do some heavier-weight tracing on the
>       parent, and tell it to quiesce.
> 
> 2. quiesce event in the parent
> 
>    This happens if 1(c) decided it should.  (For the ptrace model, this
>    is where it just stays stopped awaiting PTRACE_CONT.)  After the
>    revamp, this will not really be different from the syscall-exit
>    event, which you might have enabled just now in the clone event
>    callback.  If you are interested in the user-level program state of
>    the parent that just forked/cloned, the kosher thing is to start
>    inspecting it here.  (The child's tid will first be visible in the
>    parent's return value register here, for example.)
> 
> 3. join-group event for the child
> 
>    This "event" is an abstract idea, not a separate thing that occurs
>    at low level.  The notion is similar to a systemtap "begin" probe.
>    The main reason I distinguish this from the clone event and the
>    child's start event (below) is to unify this way of organizing
>    things with the idea of attaching to an interesting set of processes
>    and threads already alive.  i.e., a join-group event happens when
>    you start a session that probes a thread, as well as when a thread
>    you are already probing creates another thread you choose to start
>    probing from birth.
> 
>    You can think of this as the place that installs the utrace event
>    mask for the thread, though that's intended to be implicit in the
>    idea of what a tracing group is.  This is the place where you'd
>    install any per-thread kind of tracing setup, which might include hw
>    breakpoints/watchpoints.  For the attach case, where the thread was
>    not part of an address space already represented in the tracing
>    group, this could be the place to insert breakpoints (aka uprobes).
> 
> 4. "start" event in the child
> 
>    This is not a separate low-level event, but just the first event you
>    see reported by the child.  If you said you were interested (in the
>    clone/join-group event), then this will usually be the quiesce event.
>    But note that the child's first event might be death, if it was sent
>    a SIGKILL before it had a chance to run.
> 
>    This is close to the process.start event in process.stp, but in a
>    place just slightly later where it's thoroughly kosher in utrace
>    terms.  Here is the first time it's become possible to change the new
>    thread's user_regset state.  Everything in the kernel perspective and
>    the parent's perspective about the new thread start-up has happened
>    (including CLONE_CHILD_SETTID), but the thread has yet to run its
>    first user instruction.
> 
> Now, let's describe the things that make sense to a user in terms of
> these building blocks, in the systemtap context.  I'm still just using
> this as an abstraction to describe what we do with utrace.  But it's not
> just arbitrary.  I think the "grouping" model is a good middle ground
> between the fundamental facilities we have to work with and the natural
> programmer's view for the user that we want to get to.

At this point, I'm liking your "grouping" model (although I have a few
quibbles later on).  Note that currently the grouping model doesn't
really exist - each probe has its own utrace engine, even probes on the
same pid/exe.

> Not that it would necessarily be implemented this way, but for purposes
> of discussion imagine that we have the tracing group concept above as a
> first-class idea in systemtap, and the primitive utrace events as probe
> types.  The probe types users see are done in tapsets.  A systemtap
> session starts by creating a series of tracing groups.  I think of each
> group having a set of event rules (which implies its utrace event mask).
> In systemtap, the rules are the probes active on that group.  I'll
> describe the rules that would be implicit (i.e. in tapset code, or
> whatever) and apply in addition to (before) any script probes on the
> same specific low-level events (clone/exec).

I'd think we'd want to hide the low-level stuff from users and not
expose them at the script level, but I could be talked out of it.

> When there are any global probes on utrace events, make a group we'll
> call {global}.  (Add all live threads.)  Its rules are:
> 	clone -> child joins the group
> (Down the road there may be special utrace support to optimize the
> global tracing case over universal attach.)

This basically happens now underneath and isn't available at the script
level, but there is a bug that asks for this functionality.

> For a process(PID) probe, make a group we'll call {process PID}.
> (Add all threads of live process PID.)  Its rules are:
> 	clone with CLONE_THREAD -> child joins the group
> 	clone without CLONE_THREAD -> child leaves the group
>
> Here I take a PID of 123 to refer to the one live process with tgid 123
> at the start of the systemtap session, and not any new process that
> might come to exist during the session and happen to be assigned tgid 123.

Yep, that's the way the current code works.  The current code "sort of"
treats process(PID) like a special case of process.exename.

> For a process.execname probe, make a group we'll call {execname "foo"}.
> Its rules are:
> 	clone -> child joins the group
> 	exec -> if execname !matches "foo", leave the group

Here's my quibble.

I like the process(PID) behavior you outline above, but I'm not sure I
like the difference in behavior between it and the process.execname
behavior.

Here's a concrete example to see if I'm reading you correctly.  Assume
pid 123 points to /bin/bash and I'm doing syscall tracing.  If I'm
tracing by pid, I'm not going to get syscall events between the fork and
the exec for the child.  If I'm tracing by exename, I am going to get
the syscall events between the fork and exec.

But, I certainly like the idea of tracing by 'pid' - and by 'pid' we
meant a tgid, not a tid.  So, a multi-threaded 'pid' tracing would work
as a user *meant*, but not exactly as he *said*.

> When there are any process.execname probes, then there is an implicit
> global probe on exec.  In effect, {global} also has the rule:
> 	exec -> if execname matches "foo", join group {execname "foo"}

Yes.

> The probes a user wants to think about might be:
> 
> probe process.fork.any = probe utrace.clone if (!(clone_flags &
> CLONE_THREAD))
> probe process.fork.fork = probe utrace.clone if (!(clone_flags & CLONE_VM))
> probe process.vfork = probe utrace.clone if (clone_flags & CLONE_VFORK)
> probe process.create_thread = probe utrace.clone if (clone_flags &
> CLONE_THREAD)
> probe process.thread_start
> probe process.child_start
> 
> The {thread,child}_start probes would be some sort of magic for running
> in the report_quiesce callback of the new task after the report_clone
> callback in the creator decided we wanted to attach and set up to see
> that probe.

I'm lost on the difference between 'process.fork.any' and
'process.fork.fork'.  Does 'process.fork.any' include 'process.vfork'?


On more question.  Frank and I bounced a few ideas on irc the other day,
and we wondered if there was a good way on UTRACE_EVENT(DEATH) to tell
the difference between a "thread" death and "process" death?

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)
Follow-Ups:
- Re: what does 'probe process(PID_OR_NAME).clone' mean?
  - From: Ananth N Mavinakayanahalli
- Re: what does 'probe process(PID_OR_NAME).clone' mean?
  - From: Roland McGrath
References:
- what does 'probe process(PID_OR_NAME).clone' mean?
  - From: David Smith
- Re: what does 'probe process(PID_OR_NAME).clone' mean?
  - From: David Smith
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]