This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] randomize benchtests


On Fri, May 17, 2013 at 04:58:32PM +0200, Torvald Riegel wrote:
> On Fri, 2013-05-17 at 16:05 +0200, OndÅej BÃlka wrote: 
> > On Fri, May 17, 2013 at 02:44:24PM +0200, Torvald Riegel wrote:
> > > On Fri, 2013-05-17 at 13:47 +0200, OndÅej BÃlka wrote:
> > > > On Fri, May 17, 2013 at 01:16:01PM +0200, Torvald Riegel wrote:
> > > > > On Fri, 2013-05-17 at 12:44 +0200, OndÅej BÃlka wrote:
> > > > > > On Fri, May 17, 2013 at 12:24:30PM +0200, Torvald Riegel wrote:
> > > > > > > On Mon, 2013-04-22 at 14:56 +0200, OndÅej BÃlka wrote:
> > > > > > > > On Mon, Apr 22, 2013 at 05:44:14PM +0530, Siddhesh Poyarekar wrote:
> > > > > > > > > On 22 April 2013 17:30, OndÅej BÃlka <neleai@seznam.cz> wrote:
snip
> > > > This only adds noise which can be controled by sufficient
> > > > number of samples.
> > > > 
> > > > Reproducibity? These tests are not reproducible nor designed to be
> > > > reproducible.
> > > 
> > > They should be, though not necessarily at the lowest level.  If they
> > > wouldn't be reproducible in the sense of being completely random, you
> > > couldn't derive any qualitative statement -- which we want to do,
> > > ultimately.
> > 
> > You must distinguish between http://en.wikipedia.org/wiki/Noise
> > and http://en.wikipedia.org/wiki/Bias. Expected value of former does not
> > depend on selected implementation where it does not matter in latter.
> 
> This is unrelated to what I said.
> 
It is related (see below).

> > What should be reproducible are ratios between implementations in single
> > test(see below). This is thing that matters.
> 
> That's *one thing* that we can try to make reproducible.  What matters
> in the end is that we find out whether there was a performance
> regression, meaning that our current implementation doesn't have the
> performance properties anymore that it once had (eg, it's now slower
> than an alternative).  Our performance tests need to give us a
> reproducible results in the sense that we can rely on them showing
> performance regressions.
> 
To test regressions you need to compare with alternatives. Testing all
alternatives together is better as it avoid lot of possible errors.

> > When you compare different
> > runs you introduce bias without care.
> 
> When comparing results from different machines, we *may* be comparing
> apples and oranges, but we're not necessarily doing so; this really just
> depends on whether the difference in the setups actually make a
> difference for the question we want to answer.
>
I said that they are worthless to make conclusions. You cannot decide
that something is faster because you had two different measurements. To
what happens you need more granular benchmarking.

> Where did I suggest to compare results from different machines *as-is*
> without considering the differences between the machines?  But at the
> same time, we can't expect to get accurate measurements all the time, so
> we need to deal with imprecise data.
> 
> > and as you said:
> > > Even if there is noise that we can't control, this doesn't mean it's
> > > fine to add further noise without care (and not calibrating/measuring
> > 
> > 
> > > 
> > > > Runs in same machine are affected by many environtmental
> > > > effects and anything other than randomized comparison of implementations
> > > > in same run has bias that makes data worthless.  
> > > 
> > > Even if there is noise that we can't control, this doesn't mean it's
> > > fine to add further noise without care (and not calibrating/measuring
> > > against a loop with just the rand_r would be just that).
> > > 
> > Adding noise is perfectly fine. You estimate variance and based of this
> > information you choose number of iterations such that measurement error caused 
> > by it is in 99% of cases within 1% of mean.
> > 
> > You probably did mean bias
> 
> Whether something is noise or bias depends on the question your asking.
>
Noise and bias are techincal term with fixed meaning and you must
distinguish between them.

When you write benchmark

time = exact measurement;
if (implementation ==  mine)
  time /= 2;

then it is biased but not noisy.

When you write

time = exact measurement + random();

then it is noisy but not biased.

>From wikipedia:

NOISE:

In signal processing or computing noise can be considered random
unwanted data without meaning; that is, data that is not being used to
transmit a signal, but is simply produced as an unwanted by-product of
other activities. "Signal-to-noise ratio" is sometimes used to refer to
the ratio of useful to irrelevant information in an exchange.

BIAS:
 
In statistics, there are several types of bias: Selection bias, where
there is an error in choosing the individuals or groups to take part in
a scientific study. It includes sampling bias, in which some members of
the population are more likely to be included than others. Systematic bias 
or systemic bias are external influences that may affect the accuracy of 
statistical measurements.

> > and this is reason why we do not try to
> > compare different runs.
> 
> But in practice, you'll have to to some extent, if we want to take
> user-provided measurements into account.
> 
For user-provided measurements first question is verify that they
measure correct metric. There are several pitfalls you must avoid.

> > > Even if we have different runs on different machines, we can look for
> > > regressions among similar machines.  Noise doesn't make the data per se
> > > worthless.  And randomizing certain parameters doesn't necessarily
> > Did you try run test twice? 
> 
> ??? I can't see any relevant link between that sentence of yours and
> what we're actually discussing here.
>
You need to know how big these random factors are. Run test twice and
you will see.

> > > remove any bias, because to do that you need to control all the
> > > parameters with your randomization.  And even if you do that, if you
> > That you cannot eliminate something completely does not mean that you
> > should give up. A goal is to manage http://en.wikipedia.org/wiki/Systematic_error
> > and have it within reasonable bounds. 
> 
> How does that conflict with what I said?
> 
> > A branch misprediction and cache issues are major issues that can cause
> > bias and randomization is crucial. 
> 
> The point I made is that randomization isn't necessarily avoiding the
> issue, so it not a silver bullet.
> 
> > Other factors are just random and they do not favor one implementation
> > over another in significant way. 
> 
> I believe we need to be careful here to not be dogmatic.  First because
> it makes the discussions harder.  Second because we're talking about
> making estimations about black boxes; there's no point in saying "X is
> the only way to measure this" or "Only Y matters here", because we can't
> expect to 100% know what's going on -- it will always involve a set of
> assumptions, tests for those, and so on.  Thus being rather open-minded
> than dogmatic is helpful.
>
There is difference between being scientific and dogmatic. There are
good reasons why you should do things in certain way unless you know
that reason doing something other is usually a mistake. 

What is best way to measure something depends what you want to measure.
So what do you thing are important properties?

What we estimate are not entirely black boxes. We have relatively
accurate models how processor work.  

 
> > > use, say, a uniform distribution for a certain input param, but most our
> > > users are actually interested in just a subset of the input range most
> > I do exactly that, see my dryrun/profiler. I do not know why I am
> > wasting time trying to improve this.
> > > of the time, then even your randomization isn't helpful.
> > > 
> > > To give an example: It's fine if we get measurements for machines that
> > > don't control their CPU frequencies tightly, but it's not fine to throw
> > > away this information (as you indicated by dismissing the idea of a
> > > warning that someone else brought up).
> > Where did I wrote that I dismissed it? I only sayed it that there are
> > more important factors to consider. If you want to write patch that warns 
> > and sets cpu frequency fine.
> 
> The answer to this suggestion by Petr Baudis started with:
> "Warning has problem that it will get lost in wall of text. 
> 
> Only factor that matters is performance ratio between implementations."
> 
> Which sounds pretty dismissive to me.  If it wasn't meant like that,
> blame translation...


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]