This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug runtime/11308] aggregate operations for @variance, @skew, @kurtosis


https://sourceware.org/bugzilla/show_bug.cgi?id=11308

--- Comment #1 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 9311
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9311&action=edit
proposed patch

The variance of N data points is V = S / (N - 1) where S is the sum of squares
of the deviations from the mean.  Here is an attempt to implement @variance()
operator using Knuth's algorithm [1]:

=======
def online_variance(data):
    n = 0
    mean = 0.0
    M2 = 0.0

    for x in data:
        n += 1
        delta = x - mean
        mean += delta/n
        M2 += delta*(x - mean)

    if n < 2:
        return float('nan')
    else:
        return M2 / (n - 1)
=======

This patch is based on current systemtap implementation of the aggregation
operators, which first pre-aggregates the data per each CPU (__stp_stat_add()),
and then, when the aggregations are actually being read via e.g. @sum (or
@variance), they are aggregated again, this time across all the CPUs
(_stp_stat_get()) and outputted.  This approach saves shared resources at the
collection time.  So, in this patch, per cpu variances are being collected
first and then they are being aggregated again across all the CPUs to give the
resulting @variance.  The N is assumed to be N >> 1 and so the resulting
@variance() is being computed as a simple mean of per-cpu variances.  Integer
arithmetic is being used.  With this patch, we get something relatively small
for data points closely spread along the mean, and something relatively big for
data points widely spread along the mean.  So it passes a rough sanity test:

=======
# stap -e 'global a probe oneshot { for(i=0; i<1000; i++) { a<<<42 } }  probe
end { printdln(", ", @count(a), @max(a), @variance(a)) }'
1000, 42, 1
# stap -e 'global a probe oneshot { for(i=0; i<1000; i++) { a<<<42 } for(i=0;
i<20; i++) { a<<<99 } }  probe end { printdln(", ", @count(a), @max(a),
@variance(a)) }'
1020, 99, 65
# stap -e 'global a probe oneshot { for(i=0; i<1000; i++) { a<<<i } }  probe
end { printdln(" ", @count(a), @max(a), @variance(a)) }'
1000 999 332833
# 
=======


-------------------------------------------
[1]
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm

-- 
You are receiving this mail because:
You are the assignee for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]