This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: Proposed systemtap access to perfmon hardware
Frank Ch. Eigler wrote:
wcohen wrote:
To try to get a feel on how the performance monitoring hardware
support would work in SystemTap I wrote some simple examples.
Nice work. To flesh out the operational model (and please correct me
if I'm wrong): the way this stuff would all work is:
- The systemtap translator would be linked with libpfm from perfmon2.
(libpfm license is friendly.)
The libpfm library license is an MIT license, so it should be
compatible with the systemtap licensing.
- This library would be used at translation time to map perfmon.* probe
point specifications to PMC register descriptions (pfmlib_output_param_t).
(This will require telling the system the exact target cpu type for
cross-instrumentation.)
Yes, this complicates the cross kernel (build instrumentation on one
system and run instrument on another). Different processors
architectures could be used on each. Some performance monitoring systems
such as PAPI has mappings for some generic names. This might help in
some cases. However, there are some differences in computer architecture
that just do not translate to the generic models
- These descriptions would be emitted into the C code, for actual
installation during module initialization. For our first cut, since
there appears to exist no kernel-side management API at the moment,
the C code would directly manipulate the PMC registers. (This means
no coexistence for oprofile or other concurrent perfctr probing.
C'est la vie.)
Would prefer to reuse to other software to access the performance
monitoring hardware. Don't want to generate yet another different piece
of software that uses the performance monitoring hardware. We want
64-bit values, but a number of the counters are much smaller than that
(32-bit). On the pentium 4 the access to the performance counters is
complicated and would prefer not reinventing the code to access the
performance counters. This mechanism will only work with the global
setup like sampling per thread would be unsupported. Also need to
translate between the name and the event number the table in OProfile
and perfmon are getting pretty large to keep all that information and
catch any inabilities to map events to a register.
One advantage of generating the C code would be that it would work with
existing RHEL4 kernel.
- The "sample" type perfmon probes would map to the same kind of
dispatch/callback as the current "timer.profile": the probe handler
should have valid pt_regs available.
Yes, the pt_regs will be available to the sample type probe.
- The free-running type perfmon probes, probably named
"perfctr.SPEC.setup" or ".start" or ".begin" would map to a one-time
initialization that passes a token (PMC counter number?) to the
handler. Other probe handlers can then query/manipulate the
free-running counter using that number via the start/stop/query
functions.
>
Is that sufficiently detailed to begin an implementation?
Pretty close. The one thing that isn't answered is the division of the
labor for the sampling probes, onetime setup vs sample handler. Want to
have some handle set in a global variable for the probe, but do not want
to execute that everytime that the sample is collected. For the
free-running probes it is pretty clear to handle the samples.
[...] print ("ipc is %d.%d \n", ipc/factor, ipc % factor);
(An aside: we should have a more compact notation for this. We won't
support floating point numbers, but integers can be commonly scaled
like this. Maybe printf("%.Nf", value), where N implies a
power-of-ten scaling factor, and printf("%*f", value, scale) for
general factors.)
Yes, some scaling mechanism would be nice in some cases. The chances of
having IPC around the value of one were pretty likely, so I put in the
scaling to give a better picture of what is going on.
-Will