This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Unified tracing buffer


Hi Linus,

Linus Torvalds wrote:

On Mon, 22 Sep 2008, Masami Hiramatsu wrote:
Sure, atomic counter might be more expensive but accurate for ordering.

Don't be silly.


An atomic counter is no more accurate for ordering than anything else.

Why?

Because all it tells you is the ordering of the atomic increment, not of the caller. The atomic increment is not related to all the other ops that the code that you trace actually does in any shape or form, and so the ordering of the trace doesn't actually imply anything for the ordering of the operations you are tracing!

Except for a single CPU, of course, but for that case you don't need a sequence number either, since the ordering is entirely determined by the ring buffer itself.

So the counter will be more expensive (cross-cpu cache bouncing for EVERY SINGLE EVENT), less useful (no real meaning for people who DO want to have a timestamp), and it's really no more "ordered" than anything that bases itself on a TSC.

The fact is, you cannot order operations based on log messages unless you have a lock around the whole caller - absolutely _no_ amount of locking or atomic accesses in the log itself will guarantee ordering of the upper layers.

Indeed. If TSC(or similar time counter) can provide synchronized-time, I don't have any comment on that(AFAIK, latest x86 and ia64 can provide it). # I might be a bit nervous about Broken TSC...

And sure, if you have locking at a higher layer, then a sequence number is sufficient, but on the other hand, so is a well-synchronized TSC.

So personally, I think that the optimal solution is:

- let each ring buffer be associated with a "gettimestamp()" function, so that everybody _can_ set it to something of their own. But default to something sane, namely a raw TSC thing.

I agree, default to TSC is enough.


- Add synchronization events to the ring buffer often enough that you can make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply noticing when the upper 32 bits change, although you could possibly do it with a heartbeat too.

- Similarly, add a synchronization event when the TSC frequency changes.

- Make the synchronization packet contain the full 64-bit TSC base, in addition to TSC frequency info _and_ the timebase.

- From those synchronization events, you should be able to get a very accurate timestamp *after* the fact from the raw TSC numbers (ie do all the scaling not when you gather the info, but when you present it), even if you only spent 32 bits of TSC info on 99% of all events (an just had a overflow log occasionally to get the rest of the info)

- Most people will be _way_ happier with a timestamp that has enough precision to also show ordering (assuming that the caller holds a lock over the operation _including_ the tracing) than they would ever be with a sequence number.

- people who really want to can consider the incrementing counter a TSC, but it will suck in so many ways that I bet it will not be very popular at all. But having the option to set a special timestamp function will
give people the option (on a per-buffer level) to make the "TSC" be a simple incrementing 32-bit counter using xaddl and the upper bits incrementing from a timer, but keep that as a "ok, the TSC is really broken, or this architecture doesn't support any fast cycle counters at all, or I really don't care about time, just sequence, and I guarantee I have a single lock in all callers that makes things unambiguous"

Thank you very much for giving me a good idea! I agree with you.

Note the "single lock" part. It's not enough that you make any trace thing under a lock. They must be under the _same_ lock for all relevant events for you to be able to say anything about ordering. And that's actually pretty rare for any complex behavior.

The timestamping, btw, is likely the most important part of the whole logging thing. So we need to get it right. But by "right" I mean really really low-latency so that it's acceptable to everybody, real-time enough that you can tell how far apart events were, and precise enough that you really _can_ see ordering.

The "raw TSC value with correction information" should be able to give you all of that. At least on x86. On some platforms, the TSC may not give you enough resolution to get reasonable guesses on event ordering.

Linus

-- Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]