This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
[Bug runtime/10234] clean up aggregate hard-coded logic
- From: "mcermak at redhat dot com" <sourceware-bugzilla at sourceware dot org>
- To: systemtap at sourceware dot org
- Date: Fri, 23 Sep 2016 14:35:26 +0000
- Subject: [Bug runtime/10234] clean up aggregate hard-coded logic
- Auto-submitted: auto-generated
- References: <bug-10234-6586@http.sourceware.org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=10234
--- Comment #7 from Martin Cermak <mcermak at redhat dot com> ---
I can verify the above patch functionality e.g. the following way: When using
the above patch, I can see that e.g. when printing only @count(), but not the
@sum(), the @sum() computations get skipped during the module preprocessing as
intended and they do not show in the module disassembly:
=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@count(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
$
=======
Once the @sum() computations are needed, they expectedly also appear in the
module disassembly:
=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@sum(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
/usr/local/share/systemtap/runtime/stat-add.c:24
sd->sum += val;
620: 78 6f js 691 <__stp_stat_add_4+0x121>
--
/usr/local/share/systemtap/runtime/stat-add.c:24
sd->sum += val;
620: 73 74 jae 696 <.LC17+0x496>
--
/usr/local/share/systemtap/runtime/stat-add.c:24
sd->sum += val;
61f: 00 24 00 add %ah,(%rax,%rax,1)
--
/usr/local/share/systemtap/runtime/stat-add.c:24
sd->sum += val;
61f: 00 00 add %al,(%rax)
--
/usr/local/share/systemtap/runtime/stat-add.c:24
sd->sum += val;
620: 65 6c gs insb (%dx),%es:(%rdi)
--
/usr/local/share/systemtap/runtime/stat-add.c:24
sd->sum += val;
620: 65 5f gs pop %rdi
$
=======
Indeed, these computations live all in the stat-add.c, which is what the patch
from Comment #6 introduces. That's all good.
Now trying another approach based on function inlining and compiler
optimizations. First, bringing back the runtime conditions similar to Comment
#5, and inlining __stp_stat_add():
=======
$ git diff runtime/stat-common.c
diff --git a/runtime/stat-common.c b/runtime/stat-common.c
index e58b1c2..3a0dbdf 100644
--- a/runtime/stat-common.c
+++ b/runtime/stat-common.c
@@ -288,7 +288,7 @@ static void _stp_stat_print_histogram(Hist st, stat_data
*sd)
_stp_print_flush();
}
-static void __stp_stat_add(Hist st, stat_data *sd, int64_t val)
+static inline void __stp_stat_add(Hist st, stat_data *sd, int64_t val)
{
int n;
int delta = 0;
@@ -305,11 +305,13 @@ static void __stp_stat_add(Hist st, stat_data *sd,
int64_t val)
sd->avg_s = val << sd->shift;
sd->_M2 = 0;
} else {
- sd->count++;
- sd->sum += val;
- if (val > sd->max)
+ if (sd->stat_ops &
(STAT_OP_COUNT|STAT_OP_AVG|STAT_OP_VARIANCE))
+ sd->count++;
+ if (sd->stat_ops & (STAT_OP_SUM|STAT_OP_AVG|STAT_OP_VARIANCE))
+ sd->sum += val;
+ if ((sd->stat_ops & STAT_OP_MAX) && (val > sd->max))
sd->max = val;
- if (val < sd->min)
+ if ((sd->stat_ops & STAT_OP_MIN) && (val < sd->min))
sd->min = val;
/*
* Following is an optimization that improves performance
$
=======
Now, testing the above patch, optimizing the module with -O3 (added -O3 to
buildrun.cxx:477). The compiler optimization is there (the second
specification overrides the first):
=======
$ eu-readelf --debug-dump=info blah.ko | grep produce | grep '\-O' | sort -u
producer (strp) "GNU C89 6.2.1 20160901 (Red Hat
6.2.1-1) -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387
-mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic
-mno-red-zone -mcmodel=kernel -maccumulate-outgoing-args -mfentry -march=x86-64
-g -O2 -O3 -std=gnu90 -p -fno-strict-aliasing -fno-common -falign-jumps=1
-falign-loops=1 -funit-at-a-time -fno-delete-null-pointer-checks
-fstack-protector-strong -fno-omit-frame-pointer -fno-optimize-sibling-calls
-fvar-tracking-assignments -fno-strict-overflow -fconserve-stack
-freorder-blocks -fasynchronous-unwind-tables -fno-ipa-icf --param
allow-store-data-races=0"
$
=======
That looks fine, so now the real check. When @sum() is needed (println'd),
related computations should be there. They are:
=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@sum(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
4caf: b8 01 00 00 00 mov $0x1,%eax
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
5249: 4c 8b 56 28 mov 0x28(%rsi),%r10
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
52f4: 4c 8b 56 28 mov 0x28(%rsi),%r10
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
4cac: 66 5a pop %dx
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
5246: d7 xlat %ds:(%rbx)
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
52f1: fa cli
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
4cac: 50 push %rax
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
5248: 5f pop %rdi
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
52f1: 6c insb (%dx),%es:(%rdi)
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
4cad: 50 push %rax
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
5247: 30 9f 00 00 00 00 xor %bl,0x0(%rdi)
$
=======
But now, when @sum() becomes unneeded (not println'd), the related computations
do *not* disappear (where I was hoping for them to disappear):
=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@count(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
4caf: b8 01 00 00 00 mov $0x1,%eax
--
[ ... stuff deleted, the objdump output is very close to the one right above
one ... ]
=======
Here, the highest -O3 has been used with relatively modern
gcc-6.2.1-1.fc26.x86_64. So this doesn't work. It also doesn't improve the
performance: A brief performance comparison of the approaches:
=======
$ # The above, inlined __stp_stat_add() plus run-time conditions:
$ for i in `seq 1 5`; do stap -vtg --poison-cache --suppress-time-limits
test.stp |& grep ^begin; done
begin, (test.stp:5:1), hits: 1, cycles: 1194232min/1194232avg/1194232max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 1432136min/1432136avg/1432136max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 644615min/644615avg/644615max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 801981min/801981avg/801981max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 784444min/784444avg/784444max, from:
begin, index: 0
$
$
$ # GCC pre-processing based patch from Comment #6 (this looks like a real
improvement):
$ for i in `seq 1 5`; do stap -vtg --poison-cache --suppress-time-limits
test.stp |& grep ^begin; done
begin, (test.stp:5:1), hits: 1, cycles: 606954min/606954avg/606954max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 1360988min/1360988avg/1360988max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 584298min/584298avg/584298max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 758953min/758953avg/758953max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 589420min/589420avg/589420max, from:
begin, index: 0
$
$
$ # Clean fresh git master stap without any optimization attempt:
$ for i in `seq 1 5`; do stap -vtg --poison-cache --suppress-time-limits
test.stp |& grep ^begin; done
begin, (test.stp:5:1), hits: 1, cycles: 726612min/726612avg/726612max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 771579min/771579avg/771579max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 819065min/819065avg/819065max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 640921min/640921avg/640921max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 725201min/725201avg/725201max, from:
begin, index: 0
$
=======
What can I do to make the "inlining" approach work better?
--
You are receiving this mail because:
You are the assignee for the bug.