This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
[Bug runtime/10234] clean up aggregate hard-coded logic
- From: "mcermak at redhat dot com" <sourceware-bugzilla at sourceware dot org>
- To: systemtap at sourceware dot org
- Date: Fri, 16 Sep 2016 07:59:33 +0000
- Subject: [Bug runtime/10234] clean up aggregate hard-coded logic
- Auto-submitted: auto-generated
- References: <bug-10234-6586@http.sourceware.org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=10234
--- Comment #5 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 9513
--> https://sourceware.org/bugzilla/attachment.cgi?id=9513&action=edit
test.stp
After attempts to use more complex scenario with multiple CPUs (and getting
obscure results) I fell back to following simple one:
=======
$ git diff
diff --git a/runtime/stat-common.c b/runtime/stat-common.c
index e58b1c2..5a9be70 100644
--- a/runtime/stat-common.c
+++ b/runtime/stat-common.c
@@ -305,11 +305,36 @@ static void __stp_stat_add(Hist st, stat_data *sd,
int64_t val)
sd->avg_s = val << sd->shift;
sd->_M2 = 0;
} else {
+ #if defined(OPT2) || defined(OPT3)
+ if (sd->stat_ops &
(STAT_OP_COUNT|STAT_OP_AVG|STAT_OP_VARIANCE))
+ sd->count++;
+ #else
sd->count++;
+ #endif
+
+ #if defined(OPT2) || defined(OPT3)
+ if (sd->stat_ops & (STAT_OP_SUM|STAT_OP_AVG|STAT_OP_VARIANCE))
+ sd->sum += val;
+ #else
sd->sum += val;
+ #endif
+
+ #ifdef OPT2
+ if (sd->stat_ops & STAT_OP_MAX && val > sd->max)
+ #elif defined(OPT3)
+ if (unlikely(sd->stat_ops & STAT_OP_MAX && val > sd->max))
+ #else
if (val > sd->max)
+ #endif
sd->max = val;
+
+ #ifdef OPT2
+ if (sd->stat_ops & STAT_OP_MIN && val < sd->min)
+ #elif defined(OPT3)
+ if (unlikely(sd->stat_ops & STAT_OP_MIN && val < sd->min))
+ #else
if (val < sd->min)
+ #endif
sd->min = val;
/*
* Following is an optimization that improves performance
$
=======
I've repeatedly used following loop for the testing:
=======
for j in ' ' \-DOPT2 \-DOPT3; do
for i in `seq 1 20`; do
STAPOUT=$(mktemp)
stap -vtg --poison-cache --suppress-time-limits $j ./test.stp
>& $STAPOUT
cat $STAPOUT | awk -F/ '/^begin/ {print $3}' | sed 's/avg//'
done | awk '{ total += $1; count++ } END { printf total/(count *
1000) }'
echo " $k $j"
done
=======
where test.stp (attached) uses many '<<<' aggregations over fixed set of once
generated random values, and all of @count, @sum, @min, @max, @avg, and
@variance are explicitly printed at the end of the script. The results are
relative (in 'time units'). Each run of the above loop produces one set of
results in the below table. Each test was running on a single CPU.
The unlikely() function relates to the branch prediction. Tested on various
architectures. Single user mode was used to lower down the unwanted
interference with unrelated userspace processes. The kernel itself seemed to
be performing some adjustments during the first few runs, see below, so I've
thrown related measurements away.
=======
[ 294.297412] perf: interrupt took too long (2509 > 2500), lowering
kernel.perf_event_max_sample_rate to 79000
[ 329.709509] perf: interrupt took too long (3139 > 3136), lowering
kernel.perf_event_max_sample_rate to 63000
[ 652.383749] perf: interrupt took too long (3927 > 3923), lowering
kernel.perf_event_max_sample_rate to 50000
=======
Results:
------------------------------------------------------------------|
ppc64 s390x x86_64
------------------| ------------------| ------------------|
518.857 412.962 2513.75
529.423 -DOPT2 423.081 -DOPT2 2555.29 -DOPT3
759.425 -DOPT3 442.206 -DOPT3 2578.75 -DOPT2
------------------| ------------------| ------------------|
580.696 389.635 2530.95
592.576 -DOPT2 415.82 -DOPT2 2545.6 -DOPT3
642.577 -DOPT3 421.315 -DOPT3 2571.62 -DOPT2
------------------| ------------------| ------------------|
552.094 388.975 2517.83
610.116 -DOPT2 416.48 -DOPT2 2550.84 -DOPT3
759.274 -DOPT3 420.661 -DOPT3 2581.71 -DOPT2
------------------| ------------------| ------------------|
595.439 385.798 2515.89
610.511 -DOPT2 421.387 -DOPT3 2562.03 -DOPT3
679.916 -DOPT3 424.299 -DOPT2 2579.27 -DOPT2
------------------| ------------------| ------------------|
532.466 389.769
576.825 -DOPT2 415.853 -DOPT2
739.139 -DOPT3 422.055 -DOPT3
------------------| ------------------|
562.744 -DOPT2 387.642
594.826 416.685 -DOPT2
978.162 -DOPT3 422.846 -DOPT3
------------------| ------------------|
638.113 -DOPT3 387.403
763.521 421.357 -DOPT2
780.33 -DOPT2 421.584 -DOPT3
------------------| ------------------|
596.114 1383.85
608.686 -DOPT2 419.845 -DOPT3
677.906 -DOPT3 449.496 -DOPT2
------------------| ------------------|
517.915 388.702
606.76 -DOPT2 421.449 -DOPT3
758.362 -DOPT3 499.101 -DOPT2
------------------| ------------------|
561.234 -DOPT2 418.358 -DOPT2
592.586 435.673 -DOPT3
640.741 -DOPT3 481.296
------------------| ------------------|
590.273 -DOPT2
595.477
760.43 -DOPT3
------------------|
533.675
546.956 -DOPT2
740.934 -DOPT3
------------------|
592.894 -DOPT2
595.04
739.344 -DOPT3
------------------|
529.315 -DOPT2
533.694
760.528 -DOPT3
------------------|
595.766
608.028 -DOPT2
699.56 -DOPT3
------------------|
546.428 -DOPT2
548.962
760.877 -DOPT3
------------------|
579.024
605.618 -DOPT2
680.475 -DOPT3
------------------|
518.279
591.874 -DOPT2
759.458 -DOPT3
------------------|
480.887 -DOPT2
578.895
740.364 -DOPT3
------------------|
459.348
469.135 -DOPT2
761.488 -DOPT3
------------------|
457.565
607.04 -DOPT2
759.243 -DOPT3
------------------|
514.944 -DOPT2
563.888
739.681 -DOPT3
------------------|
581.474
593.549 -DOPT2
659.873 -DOPT3
------------------|
532.542
589.381 -DOPT2
720.753 -DOPT3
------------------|
My conclusion is similar to the one in Comment#3: I think the benefits of
optimizing the trivial @count, @sum, @min, and @max online computations by
wrapping them into run-time conditions, is moot. Hmm, actually, maybe we could
interpret the results as "both of -DOPT2 and -DOPT3 slightly worsened the
tested performance in most cases". I might be wrong, though :-)
--
You are receiving this mail because:
You are the assignee for the bug.