This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: stptracer-20060828 has released.
- From: Martin Hunt <hunt at redhat dot com>
- To: Li Guanglei <guanglei at cn dot ibm dot com>
- Cc: Masami Hiramatsu <masami dot hiramatsu dot pt at hitachi dot com>, SystemTAP <systemtap at sources dot redhat dot com>, "Jose R. Santos" <jrs at us dot ibm dot com>, Satoshi Oshima <soshima at redhat dot com>, Hideo Aoki <haoki at redhat dot com>, Yumiko Sugita <yumiko dot sugita dot yf at hitachi dot com>
- Date: Tue, 05 Sep 2006 11:12:46 -0400
- Subject: Re: stptracer-20060828 has released.
- Organization: Red Hat Inc.
- References: <44F58A51.6060806@hitachi.com> <44F69E3C.10308@cn.ibm.com>
On Thu, 2006-08-31 at 16:30 +0800, Li Guanglei wrote:
> The first thing I want to figure out about STPTracer is how it
> performs compared with LKET. STPTracer uses an interface named gBTI
> which could only print fix number integers into a pre-reserved relayfs
> buffer, while LKET uses the _stp_printf with binary print support.
> handlers, i.e.,
> probe syscall.getsid {}
> cpu 0: loops=5000000, average=1523 ns
>
> <3> probe using lket_getsid.stp:
> cpu 0: loops=5000000, average=3079 ns
>
> <4> probe using lkst_getsid.stp:
> cpu 0: loops=5000000, average=2341 ns
>
LKET overhead = 3079 - 1523 = 1556 ns
LKST overhead = 818 ns
difference is 738 ns
> Here you can see that gBTI is around 23% faster than _stp_printf().
I'd say LKET is almost twice as fast on your machine, factoring out
probe overhead.
Looking at the source code, I can't say anything about stp_printf() vs
gbti. Certainly stp_printf cannot be taking 738 ns to simply print 4
integers.
> In
> fact a long time ago I used oprofile to find out where _stp_printf()
> cost cpu time, and it showed that a lot time is spent on _stp_vsnprintf().
Yeah, stp_printf() is just a wrapper to call stp_vsnprintf().
When I proposed and implemented stp_printf, I did a lot of benchmarks.
So I reran them with the current code vs a very simple, fast gbti-type
interface without the headers and timestamping.
function stp_binary_write(n:long, arg1:long, arg2:long, arg3:long, arg4:long) %{
_stp_binary_write (THIS->n, THIS->arg1, THIS->arg2, THIS->arg3, THIS->arg4);
%}
static void _stp_binary_write (int num, ...)
{
va_list vargs;
int i;
int64_t *args;
if (unlikely(num > MAXBINARYARGS))
num = MAXBINARYARGS;
args = _stp_reserve_bytes(num * sizeof(int64_t));
if (args != NULL) {
va_start(vargs, num);
for (i = 0; i < num; i++) {
args[i] = va_arg(vargs, int64_t);
}
va_end(vargs);
}
}
So here are the results of my 2.6GHz P4.
SystemTap BENCH2 Tue Sep 5 10:52:37 EDT 2006
kernel: 2.6.17-1.2174_FC5smp i686
Fedora Core release 5 (Bordeaux)
dragon: 10:52:37 up 12 min, 3 users, load average: 0.71, 0.54, 0.38
processors: 2 (1 physical) Intel(R) Pentium(R) 4 CPU 2.60GHz
MemTotal: 2074472 kB MemFree: 1303368 kB
----------------------------------------------------------------
For comparison, function call overhead is 471 nsecs.
Times below are nanoseconds per probe and include kprobe overhead.
----------------------------------------------------------------
+--- S = Script, R = Runtime
|+-- * = Relayfs Threads
|| NAME 1 2
S : empty probe 601 506
S : binary printf 4 integers (%8b) 1073 899
S : binary printf 4 integers (%4b) 997 767
S : binary printf 4 integers (%b) 953 745
S : _stp_binary_write 4 integers 988 779
So you can see a slight advantage for _stp_binary_write when you are writing
64-bit values. When the sizes are smaller, stp_printf has a slight advantage.
Martin