This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] randomize benchtests


On Fri, 2013-05-17 at 16:05 +0200, OndÅej BÃlka wrote: 
> On Fri, May 17, 2013 at 02:44:24PM +0200, Torvald Riegel wrote:
> > On Fri, 2013-05-17 at 13:47 +0200, OndÅej BÃlka wrote:
> > > On Fri, May 17, 2013 at 01:16:01PM +0200, Torvald Riegel wrote:
> > > > On Fri, 2013-05-17 at 12:44 +0200, OndÅej BÃlka wrote:
> > > > > On Fri, May 17, 2013 at 12:24:30PM +0200, Torvald Riegel wrote:
> > > > > > On Mon, 2013-04-22 at 14:56 +0200, OndÅej BÃlka wrote:
> > > > > > > On Mon, Apr 22, 2013 at 05:44:14PM +0530, Siddhesh Poyarekar wrote:
> > > > > > > > On 22 April 2013 17:30, OndÅej BÃlka <neleai@seznam.cz> wrote:
> > > > > > > > > +      clock_gettime (CLOCK_PROCESS_CPUTIME_ID, &start);
> > > > > > > > > +      for (k = 0; k < iters; k++)
> > > > > > > > > +        {
> > > > > > > > > +         i = rand_r (&seed)%NUM_SAMPLES;
> > > > > > > > > +         BENCH_FUNC(i);
> > > > > > > > > +        }
> > > > > > > > > +      clock_gettime (CLOCK_PROCESS_CPUTIME_ID, &end);
> > > > > > > > 
> > > > > > > > This is wrong.  The interval also has the time taken to call rand_r.
> > > > > > > > 
> > > > > > > This is not wrong. You are interested only on differences between
> > > > > > > implementations and adding same time from rand_r calls does not change
> > > > > > > that. 
> > > > > > 
> > > > > > But if we should be changing the rand_r implementation in the future
> > > > > > (e.g., if we might be getting HW support for it on a certain
> > > > > > architecture), then this will lead to a difference in all our
> > > > > 
> > > > > From stdlib/rand_r.c:
> > > > > 
> > > > > /* This algorithm is mentioned in the ISO C standard, here extended
> > > > >    for 32 bits.  */
> > > > > 
> > > > > Given that we can break applications that depend on rand_r
> > > > > reproducibility it will not change. 
> > > > > If you want fully specified generator use drand48.
> > > > 
> > > > My comment was about the performance of one call to rand_r.  This isn't
> > > > about where or how it's specified, it's just about the performance, and
> > > > trying to keep it reproducible.
> > > Performance does not matter as we measure overhead with empty body and
> > > substract it.
> > 
> > So you later on added the second option that I mentioned in my previous
> > email, calibrating against a loop with just a randr call?  If so, then
> > that's fine.  But why didn't you mention this right away?
> >
> For simplicity i defered this to later commit. As this does not change
> differece between implementation performance it had low priority.
> 
> > > This only adds noise which can be controled by sufficient
> > > number of samples.
> > > 
> > > Reproducibity? These tests are not reproducible nor designed to be
> > > reproducible.
> > 
> > They should be, though not necessarily at the lowest level.  If they
> > wouldn't be reproducible in the sense of being completely random, you
> > couldn't derive any qualitative statement -- which we want to do,
> > ultimately.
> 
> You must distinguish between http://en.wikipedia.org/wiki/Noise
> and http://en.wikipedia.org/wiki/Bias. Expected value of former does not
> depend on selected implementation where it does not matter in latter.

This is unrelated to what I said.

> What should be reproducible are ratios between implementations in single
> test(see below). This is thing that matters.

That's *one thing* that we can try to make reproducible.  What matters
in the end is that we find out whether there was a performance
regression, meaning that our current implementation doesn't have the
performance properties anymore that it once had (eg, it's now slower
than an alternative).  Our performance tests need to give us a
reproducible results in the sense that we can rely on them showing
performance regressions.

> When you compare different
> runs you introduce bias without care.

When comparing results from different machines, we *may* be comparing
apples and oranges, but we're not necessarily doing so; this really just
depends on whether the difference in the setups actually make a
difference for the question we want to answer.

Where did I suggest to compare results from different machines *as-is*
without considering the differences between the machines?  But at the
same time, we can't expect to get accurate measurements all the time, so
we need to deal with imprecise data.

> and as you said:
> > Even if there is noise that we can't control, this doesn't mean it's
> > fine to add further noise without care (and not calibrating/measuring
> 
> 
> > 
> > > Runs in same machine are affected by many environtmental
> > > effects and anything other than randomized comparison of implementations
> > > in same run has bias that makes data worthless.  
> > 
> > Even if there is noise that we can't control, this doesn't mean it's
> > fine to add further noise without care (and not calibrating/measuring
> > against a loop with just the rand_r would be just that).
> > 
> Adding noise is perfectly fine. You estimate variance and based of this
> information you choose number of iterations such that measurement error caused 
> by it is in 99% of cases within 1% of mean.
> 
> You probably did mean bias

Whether something is noise or bias depends on the question your asking.

> and this is reason why we do not try to
> compare different runs.

But in practice, you'll have to to some extent, if we want to take
user-provided measurements into account.

> > Even if we have different runs on different machines, we can look for
> > regressions among similar machines.  Noise doesn't make the data per se
> > worthless.  And randomizing certain parameters doesn't necessarily
> Did you try run test twice? 

??? I can't see any relevant link between that sentence of yours and
what we're actually discussing here.

> > remove any bias, because to do that you need to control all the
> > parameters with your randomization.  And even if you do that, if you
> That you cannot eliminate something completely does not mean that you
> should give up. A goal is to manage http://en.wikipedia.org/wiki/Systematic_error
> and have it within reasonable bounds. 

How does that conflict with what I said?

> A branch misprediction and cache issues are major issues that can cause
> bias and randomization is crucial. 

The point I made is that randomization isn't necessarily avoiding the
issue, so it not a silver bullet.

> Other factors are just random and they do not favor one implementation
> over another in significant way. 

I believe we need to be careful here to not be dogmatic.  First because
it makes the discussions harder.  Second because we're talking about
making estimations about black boxes; there's no point in saying "X is
the only way to measure this" or "Only Y matters here", because we can't
expect to 100% know what's going on -- it will always involve a set of
assumptions, tests for those, and so on.  Thus being rather open-minded
than dogmatic is helpful.

> > use, say, a uniform distribution for a certain input param, but most our
> > users are actually interested in just a subset of the input range most
> I do exactly that, see my dryrun/profiler. I do not know why I am
> wasting time trying to improve this.
> > of the time, then even your randomization isn't helpful.
> > 
> > To give an example: It's fine if we get measurements for machines that
> > don't control their CPU frequencies tightly, but it's not fine to throw
> > away this information (as you indicated by dismissing the idea of a
> > warning that someone else brought up).
> Where did I wrote that I dismissed it? I only sayed it that there are
> more important factors to consider. If you want to write patch that warns 
> and sets cpu frequency fine.

The answer to this suggestion by Petr Baudis started with:
"Warning has problem that it will get lost in wall of text. 

Only factor that matters is performance ratio between implementations."

Which sounds pretty dismissive to me.  If it wasn't meant like that,
blame translation...


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]