This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
[Bug runtime/10234] clean up aggregate hard-coded logic

From: "mcermak at redhat dot com" <sourceware-bugzilla at sourceware dot org>
To: systemtap at sourceware dot org
Date: Fri, 16 Sep 2016 07:59:33 +0000
Subject: [Bug runtime/10234] clean up aggregate hard-coded logic
Auto-submitted: auto-generated
References: <bug-10234-6586@http.sourceware.org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=10234

--- Comment #5 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 9513
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9513&action=edit
test.stp

After attempts to use more complex scenario with multiple CPUs (and getting
obscure results) I fell back to following simple one:

=======
$ git diff
diff --git a/runtime/stat-common.c b/runtime/stat-common.c
index e58b1c2..5a9be70 100644
--- a/runtime/stat-common.c
+++ b/runtime/stat-common.c
@@ -305,11 +305,36 @@ static void __stp_stat_add(Hist st, stat_data *sd,
int64_t val)
                sd->avg_s = val << sd->shift;
                sd->_M2 = 0;
        } else {
+               #if defined(OPT2) || defined(OPT3)
+               if (sd->stat_ops &
(STAT_OP_COUNT|STAT_OP_AVG|STAT_OP_VARIANCE))
+                   sd->count++;
+               #else
                sd->count++;
+               #endif
+
+               #if defined(OPT2) || defined(OPT3)
+               if (sd->stat_ops & (STAT_OP_SUM|STAT_OP_AVG|STAT_OP_VARIANCE))
+                   sd->sum += val;
+               #else
                sd->sum += val;
+               #endif
+
+               #ifdef OPT2
+               if (sd->stat_ops & STAT_OP_MAX && val > sd->max)
+               #elif defined(OPT3)
+               if (unlikely(sd->stat_ops & STAT_OP_MAX && val > sd->max))
+               #else
                if (val > sd->max)
+               #endif
                        sd->max = val;
+
+               #ifdef OPT2
+               if (sd->stat_ops & STAT_OP_MIN && val < sd->min)
+               #elif defined(OPT3)
+               if (unlikely(sd->stat_ops & STAT_OP_MIN && val < sd->min))
+               #else
                if (val < sd->min)
+               #endif
                        sd->min = val;
                /*
                 * Following is an optimization that improves performance
$ 
=======

I've repeatedly used following loop for the testing:

=======
for j in ' ' \-DOPT2 \-DOPT3; do
            for i in `seq 1 20`; do
                STAPOUT=$(mktemp)
                stap -vtg --poison-cache --suppress-time-limits $j ./test.stp
>& $STAPOUT
                cat $STAPOUT | awk -F/ '/^begin/ {print $3}' | sed 's/avg//'
            done | awk '{ total += $1; count++ } END { printf total/(count *
1000) }'
            echo " $k $j"
done
=======

where test.stp (attached) uses many '<<<' aggregations over fixed set of once
generated random values, and all of @count, @sum, @min, @max, @avg, and
@variance are explicitly printed at the end of the script.  The results are
relative (in 'time units').  Each run of the above loop produces one set of
results in the below table.  Each test was running on a single CPU.

The unlikely() function relates to the branch prediction.  Tested on various
architectures.  Single user mode was used to lower down the unwanted
interference with unrelated userspace processes.  The kernel itself seemed to
be performing some adjustments during the first few runs, see below, so I've
thrown related measurements away.

=======
[  294.297412] perf: interrupt took too long (2509 > 2500), lowering
kernel.perf_event_max_sample_rate to 79000
[  329.709509] perf: interrupt took too long (3139 > 3136), lowering
kernel.perf_event_max_sample_rate to 63000
[  652.383749] perf: interrupt took too long (3927 > 3923), lowering
kernel.perf_event_max_sample_rate to 50000
=======



Results:
------------------------------------------------------------------|
      ppc64                    s390x                  x86_64
------------------|     ------------------|     ------------------|
518.857                 412.962                 2513.75
529.423   -DOPT2        423.081   -DOPT2        2555.29   -DOPT3
759.425   -DOPT3        442.206   -DOPT3        2578.75   -DOPT2
------------------|     ------------------|     ------------------|
580.696                 389.635                 2530.95
592.576   -DOPT2        415.82   -DOPT2         2545.6   -DOPT3
642.577   -DOPT3        421.315   -DOPT3        2571.62   -DOPT2
------------------|     ------------------|     ------------------|
552.094                 388.975                 2517.83
610.116   -DOPT2        416.48   -DOPT2         2550.84   -DOPT3
759.274   -DOPT3        420.661   -DOPT3        2581.71   -DOPT2
------------------|     ------------------|     ------------------|
595.439                 385.798                 2515.89
610.511   -DOPT2        421.387   -DOPT3        2562.03   -DOPT3
679.916   -DOPT3        424.299   -DOPT2        2579.27   -DOPT2
------------------|     ------------------|     ------------------|
532.466                 389.769
576.825   -DOPT2        415.853   -DOPT2
739.139   -DOPT3        422.055   -DOPT3
------------------|     ------------------|
562.744   -DOPT2        387.642
594.826                 416.685   -DOPT2
978.162   -DOPT3        422.846   -DOPT3
------------------|     ------------------|
638.113   -DOPT3        387.403
763.521                 421.357   -DOPT2
780.33   -DOPT2         421.584   -DOPT3
------------------|     ------------------|
596.114                 1383.85
608.686   -DOPT2        419.845   -DOPT3
677.906   -DOPT3        449.496   -DOPT2
------------------|     ------------------|
517.915                 388.702
606.76   -DOPT2         421.449   -DOPT3
758.362   -DOPT3        499.101   -DOPT2
------------------|     ------------------|
561.234   -DOPT2        418.358   -DOPT2
592.586                 435.673   -DOPT3
640.741   -DOPT3        481.296
------------------|     ------------------|
590.273   -DOPT2
595.477
760.43   -DOPT3
------------------|
533.675
546.956   -DOPT2
740.934   -DOPT3
------------------|
592.894   -DOPT2
595.04
739.344   -DOPT3
------------------|
529.315   -DOPT2
533.694
760.528   -DOPT3
------------------|
595.766
608.028   -DOPT2
699.56   -DOPT3
------------------|
546.428   -DOPT2
548.962
760.877   -DOPT3
------------------|
579.024
605.618   -DOPT2
680.475   -DOPT3
------------------|
518.279
591.874   -DOPT2
759.458   -DOPT3
------------------|
480.887   -DOPT2
578.895
740.364   -DOPT3
------------------|
459.348
469.135   -DOPT2
761.488   -DOPT3
------------------|
457.565
607.04   -DOPT2
759.243   -DOPT3
------------------|
514.944   -DOPT2
563.888
739.681   -DOPT3
------------------|
581.474
593.549   -DOPT2
659.873   -DOPT3
------------------|
532.542
589.381   -DOPT2
720.753   -DOPT3
------------------|


My conclusion is similar to the one in Comment#3: I think the benefits of
optimizing the trivial @count, @sum, @min, and @max online computations by
wrapping them into run-time conditions, is moot.  Hmm, actually, maybe we could
interpret the results as "both of -DOPT2 and -DOPT3 slightly worsened the
tested performance in most cases".  I might be wrong, though :-)

-- 
You are receiving this mail because:
You are the assignee for the bug.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]