This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug runtime/10234] clean up aggregate hard-coded logic


https://sourceware.org/bugzilla/show_bug.cgi?id=10234

--- Comment #7 from Martin Cermak <mcermak at redhat dot com> ---
I can verify the above patch functionality e.g. the following way:  When using
the above patch, I can see that e.g. when printing only @count(), but not the
@sum(), the @sum() computations get skipped during the module preprocessing as
intended and they do not show in the module disassembly:

=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@count(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
$ 
=======

Once the @sum() computations are needed, they expectedly also appear in the
module disassembly:

=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@sum(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
/usr/local/share/systemtap/runtime/stat-add.c:24
                sd->sum += val;
     620:       78 6f                   js     691 <__stp_stat_add_4+0x121>
--
/usr/local/share/systemtap/runtime/stat-add.c:24
                sd->sum += val;
 620:   73 74                   jae    696 <.LC17+0x496>
--
/usr/local/share/systemtap/runtime/stat-add.c:24
                sd->sum += val;
 61f:   00 24 00                add    %ah,(%rax,%rax,1)
--
/usr/local/share/systemtap/runtime/stat-add.c:24
                sd->sum += val;
     61f:       00 00                   add    %al,(%rax)
--
/usr/local/share/systemtap/runtime/stat-add.c:24
                sd->sum += val;
     620:       65 6c                   gs insb (%dx),%es:(%rdi)
--
/usr/local/share/systemtap/runtime/stat-add.c:24
                sd->sum += val;
     620:       65 5f                   gs pop %rdi
$
=======

Indeed, these computations live all in the stat-add.c, which is what the patch
from Comment #6 introduces.  That's all good.






Now trying another approach based on function inlining and compiler
optimizations. First, bringing back the runtime conditions similar to Comment
#5, and inlining __stp_stat_add():

=======
$ git diff runtime/stat-common.c
diff --git a/runtime/stat-common.c b/runtime/stat-common.c
index e58b1c2..3a0dbdf 100644
--- a/runtime/stat-common.c
+++ b/runtime/stat-common.c
@@ -288,7 +288,7 @@ static void _stp_stat_print_histogram(Hist st, stat_data
*sd)
        _stp_print_flush();
 }

-static void __stp_stat_add(Hist st, stat_data *sd, int64_t val)
+static inline void __stp_stat_add(Hist st, stat_data *sd, int64_t val)
 {
        int n;
        int delta = 0;
@@ -305,11 +305,13 @@ static void __stp_stat_add(Hist st, stat_data *sd,
int64_t val)
                sd->avg_s = val << sd->shift;
                sd->_M2 = 0;
        } else {
-               sd->count++;
-               sd->sum += val;
-               if (val > sd->max)
+               if (sd->stat_ops &
(STAT_OP_COUNT|STAT_OP_AVG|STAT_OP_VARIANCE))
+                       sd->count++;
+               if (sd->stat_ops & (STAT_OP_SUM|STAT_OP_AVG|STAT_OP_VARIANCE))
+                       sd->sum += val;
+               if ((sd->stat_ops & STAT_OP_MAX) && (val > sd->max))
                        sd->max = val;
-               if (val < sd->min)
+               if ((sd->stat_ops & STAT_OP_MIN) && (val < sd->min))
                        sd->min = val;
                /*
                 * Following is an optimization that improves performance
$
=======

Now, testing the above patch, optimizing the module with -O3 (added -O3 to
buildrun.cxx:477).  The compiler optimization is there (the second
specification overrides the first):

=======
$ eu-readelf --debug-dump=info blah.ko | grep produce | grep '\-O' | sort -u
           producer             (strp) "GNU C89 6.2.1 20160901 (Red Hat
6.2.1-1) -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387
-mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic
-mno-red-zone -mcmodel=kernel -maccumulate-outgoing-args -mfentry -march=x86-64
-g -O2 -O3 -std=gnu90 -p -fno-strict-aliasing -fno-common -falign-jumps=1
-falign-loops=1 -funit-at-a-time -fno-delete-null-pointer-checks
-fstack-protector-strong -fno-omit-frame-pointer -fno-optimize-sibling-calls
-fvar-tracking-assignments -fno-strict-overflow -fconserve-stack
-freorder-blocks -fasynchronous-unwind-tables -fno-ipa-icf --param
allow-store-data-races=0"
$
=======

That looks fine, so now the real check. When @sum() is needed (println'd),
related computations should be there. They are:

=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@sum(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    4caf:       b8 01 00 00 00          mov    $0x1,%eax
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    5249:       4c 8b 56 28             mov    0x28(%rsi),%r10
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    52f4:       4c 8b 56 28             mov    0x28(%rsi),%r10
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    4cac:       66 5a                   pop    %dx
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    5246:       d7                      xlat   %ds:(%rbx)
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    52f1:       fa                      cli    
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    4cac:       50                      push   %rax
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    5248:       5f                      pop    %rdi
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    52f1:       6c                      insb   (%dx),%es:(%rdi)
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    4cad:       50                      push   %rax
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    5247:       30 9f 00 00 00 00       xor    %bl,0x0(%rdi)
$
=======

But now, when @sum() becomes unneeded (not println'd), the related computations
do *not* disappear (where I was hoping for them to disappear):

=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@count(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    4caf:       b8 01 00 00 00          mov    $0x1,%eax
--
[ ... stuff deleted, the objdump output is very close to the one right above
one ... ]
=======

Here, the highest -O3 has been used with relatively modern
gcc-6.2.1-1.fc26.x86_64.  So this doesn't work.  It also doesn't improve the
performance: A brief performance comparison of the approaches:

=======
$ # The above, inlined __stp_stat_add() plus run-time conditions:
$ for i in `seq 1 5`; do stap -vtg --poison-cache --suppress-time-limits
test.stp |& grep ^begin; done
begin, (test.stp:5:1), hits: 1, cycles: 1194232min/1194232avg/1194232max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 1432136min/1432136avg/1432136max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 644615min/644615avg/644615max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 801981min/801981avg/801981max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 784444min/784444avg/784444max, from:
begin, index: 0
$ 
$ 
$ # GCC pre-processing based patch from Comment #6 (this looks like a real
improvement):
$ for i in `seq 1 5`; do stap -vtg --poison-cache --suppress-time-limits
test.stp |& grep ^begin; done
begin, (test.stp:5:1), hits: 1, cycles: 606954min/606954avg/606954max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 1360988min/1360988avg/1360988max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 584298min/584298avg/584298max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 758953min/758953avg/758953max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 589420min/589420avg/589420max, from:
begin, index: 0
$ 
$ 
$ #  Clean fresh git master stap without any optimization attempt:
$ for i in `seq 1 5`; do stap -vtg --poison-cache --suppress-time-limits
test.stp |& grep ^begin; done
begin, (test.stp:5:1), hits: 1, cycles: 726612min/726612avg/726612max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 771579min/771579avg/771579max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 819065min/819065avg/819065max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 640921min/640921avg/640921max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 725201min/725201avg/725201max, from:
begin, index: 0
$
=======

What can I do to make the "inlining" approach work better?

-- 
You are receiving this mail because:
You are the assignee for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]