This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: boosting with preemptable kernel


Hi Quentin,

Actually, I discussed a similar idea internally with Satoshi.
And we decided the current approach is better than that, because,

Trampoline approach
- depends on the architecture. not general solution.
- increases overhead by executing trampoline code.
- is useful ONLY for booster, but is NOT useful for djprobe.

Garbage Collector (and Safety check routine)
- does NOT depends on the architecture.
- does NOT increase overhead when probing.
- is useful for not only booster but also djprobe.

Thus, I posted GC patch, and it was merged to -mm tree.
http://sources.redhat.com/ml/systemtap/2006-q4/msg00453.html

And, I know there is a weak point in the GC. It will take
a time to freeze processes. But I tried to reduce the
frequency of executing the GC by using the "dirty" flag and
some conditions. I also think we can control when it
works by introducing commit_kprobes() interface.

(If you'd like, you can copy any or all of this post to the
systemtap mailing list.)

Thanks, I Cc this to the SystemTap ML.

I didn't see your reply until about three weeks after you wrote it. I was out on Thanksgiving break when it was sent and I missed it. I still waited to reply to your note until I got my ARM kprobes work further along. I'll go ahead and reply to the mailing list too. This is my first note to the list, so I hope it gets through.

You're right, a trampoline approach is architecture-specific, but so
is much of the kprobes support code already, especially when having
to deal with jprobes and kretprobes often requiring arch support code
to be written in assembly.  But I also understand the less we have
that's architecture dependent or in assembly the better.

I disagree that an such approach increases overhead by any
significant amount and is only useful for booster code, but more on
this later down in the mail at the end.

I really disagree with the garbage collector approach.  On some
of our projects we use the Linux kernel in a soft real-time
environment.  A GC introduces sporadic indefinite postponement
problems.  It doesn't matter at all how often it runs -- once is too
much.  The basic reason for having a preemptive kernel is to reduce
indefinite postponement.  The GC is most disruptive on preemptive
kernels working against the very reason for using the preemptive
feature to begin with.  (It's so disruptive on preemptive kernels
because it calls freeze_processes() affecting the entire system.  On
non-preemptive kernels it uses the passive synchronize_sched().)

Aside from RT related impact issues the GC causes, it also degrades
poorly on SMP systems.  It causes the entire system across all CPUs
to be forced to stop executing and all held into an idle state
simultaneously.  The reason for having multiple CPUs is so that a
thread tying up one CPU doesn't impact the performance (much) of
the rest of the system.  Now we have a single thread being able to
impact the performance across all CPUs system-wide simultaneously.
(If my understanding of the GC implementation is wrong, please
correct me.  It's only from my limited understanding of reading and
following the code, not from using it.)

What I would recommend is that the GC and its hooks be able to be
conditionally enabled or disabled with an ifdef, probably in the
arch kprobes.h file like insn slot and kretprobes are now.  That way
for architectures that don't need it by having their own alternate
approach can choose not to use it.  Would that be a reasonable
suggestion?

About your last bullet mentioning djprobes, the ARM implementation
of kprobes I'm working on does have support for kretprobes and
jprobes, but not djprobes.  I have read your post from Oct '05 on
djprobes and understand it in a general way, but haven't yet wrapped
my mind fully around it.  Could you explain why such an approach
doesn't work for djprobes on x86?  Do you imagine that assertion
would hold for other non-x86 architectures as well?

Quentin

=-=-=
Below is some background on my ARM kprobes working design.  Reading
it might be helpful when discussing the above issues.  I pulled the
text out and put it down here so people wouldn't be forced to wade
through it in following the above discussion.

The basic implementation of kprobes on ARM has been complex,
regardless of approach chosen, due to the inherent nature of the
processor's design.  For some background information on the ARM
architecture, the current processors have no single-step mode and
have no "next PC" register, so there's no way to regain control of
the processor after executing instructions that modify the PC.  All
branches, calls, and other writes to the PC must be decoded and
detected at arch_prepare_kprobes() time and simulated by software at
kprobe_handler() time.  Also any instruction run from the execute
slot that reads the PC's value would give a "wrong" result.  They
must also be detected at preparation time and have its results
adjusted at execution time.

Most popular ARM instructions are capable of legally having R15
(the PC) as either a source (read) or destination (write) register,
so regardless of approach, the logic to decode many of the ARM
instructions already has to be written.  Extending it to decode and
handle the additional instructions that aren't capable of reading or
writing the PC wasn't that much additional work.

At kprobe_handler() time to execute the kprobed instruction and
return, my ARM design does not have a single generic trampoline.
I originally tried this, but found it too wasteful for my tastes
in the general case to do correctly.  (Still, it wasn't that bad.
The whole thing was about 35 machine instructions with 14 written
in assembly, but it did do a fair amount of often unnecessary
data slinging though.)  What I do instead is to have individual
trampolines tuned for specific classes of instructions.  The ARM is
a RISCish architecture.  Its instructions fall into easily decoded
classes of instructions and with its registers in specific fields
within the instruction.

The individually tuned trampolines avoid unnecessary loading up and
saving back of the pre-exception register state.  The rewritten
instructions saved in the insn slots at preparation time use
preassigned fixed registers (R0, R1, etc.).  Each tuned trampoline
assigned at preparation time loads up the necessary register state
into the specific fixed registers for its class, executes the
rewritten instruction, and writes back the changed register state.

At kprobe_handler() time, the kprobe handler calls the insn
slot handler by dereferencing its pointer saved in the
"arch_specific_insn" structure at preparation time.  The handler
loads up its zero to four registers (depending on the class of
the instruction it's written for) from the pt_regs save area and
then writes back zero to four registers.  The entire execution
handler (all pre- and post-instruction work, loading and saving
register state, and the rewritten instruction) generally takes
between 10 and 20 ARM instructions in total to run.  That number of
instructions is just in the noise for that of the kprobe handler.
Although technically it does increase the execution time of the
kprobe_handler(), the increase is so small that it cannot in any way
be considered significant.

Also with my approach, no classic boosting is ever necessary.
All instructions are effectively already "boosted" (i.e.: no
secondary breakpoint exception is ever needed or taken.)  All
kprobed instructions always complete before returning from the
primary kprobe exception.  The kprrobe exception always returns to
the instruction to be executed next after the kprobed instruction.
This design also works exactly the same way on SMP and single
processors, as well as on preemptive and non-preemptive kernels, so
no ifdef'ing is necessary.

This work may sound like a lot, but it's really not.  All decoding
and instruction rewriting software for all the ARM instructions
plus all the individually tuned trampolines when compiled fits
in less than 4KB of text space with no data space at all.  That
doesn't even account for having simpler kprobes arch support code
in arch/arm/kernel/kprobes.c.  That's smaller since it doesn't
have to account for ever handle re-entering itself via a secondary
breakpoint nor for having to support classic boost mode.

I don't consider myself a Linux kernel internals expert though.
In my approach, I may have made an invalid assumption about the
kernel's behavior somewhere, possibly in boundary cases.  This is my
biggest concern.  As my testing moves forward, I'll see if anything
turns up.

I could go into more detail of the design, but I'm sure I've bored
and confused enough people already.

Best regards,

--
Masami HIRAMATSU
Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]