This is the mail archive of the systemtap@sources.redhat.com mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: x86_64 kprobes wart removal


> If we allocate a whole L1 cache line for each single-step scratch area,
> as you suggest below, is this still a performance concern?

I expect that would address any SMP issue, and is certainly an obviously
right thing to do.  That is not what Will and I are really concerned about.
I'm just talking about the hit to icache or whatever other internal
processor hooey from rewriting the same spot, and executing a spot that was
only just written (and so by definition never partially decoded into some
part of the CPU), that sort of thing.  The x86 doesn't require explicit
icache flushes with big red warning labels on them every time you poke a
location and then want to execute it, like most other processors do--but
that doesn't mean it mightn't be costly to do so.

I seem to have lost the message I'm sure there was in the thread where you
(I think it was you) asked about testing scenarios to compare the
performance.  That is what we should get right to, instead of just us
pontificating about how it might be (I sure don't actually know anything
about the chips' performance issues at this level).  There are some obvious
torture tests that seem to me like they would demonstrate a bottleneck on
executing just-modified code if there is one.  For example, write a tight
loop that you run for a whole lot of iterations so as to usefully time it.
Insert several probes at instructions inside the loop, doing all the
insertions just once at the beginning.  The probes needn't do anything but
return, just be there to cause the kprobes single-step machinery to work
(and multiple probes to demonstrate the constant reuse of the copy slot).
Then run the loop a lot, sampling the cycle counter before and after.  Do
this with the current code and with the new one that uses a single buffer
(repeat each run a lot and average, etc).  You might or might not want to
correct for other differences like cache-alignment of the instruction
copies (in your new plan, the one spot will be aligned, whereas in the
current code most of the slots will wind up misaligned).

If overwriting a single copy location performs better, then great.  If it
performs just as well, then it's still preferable for its smaller kernel
memory footprint.  But if it turns out to perform less well, I think we
should stick with the current scheme.  (I honestly don't see any real
problems in having allocation take place at probe insertion/removal time.)


Thanks,
Roland


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]