This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH v3] Add math-inline benchmark

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Wilco Dijkstra <wdijkstr at arm dot com>
Cc: 'GNU C Library' <libc-alpha at sourceware dot org>
Date: Sat, 25 Jul 2015 13:58:49 +0200
Subject: Re: [PATCH v3] Add math-inline benchmark
Authentication-results: sourceware.org; auth=none
References: <003601d0c546$b559d3d0$200d7b70$ at com> <20150723205913 dot GA5139 at domone> <003a01d0c608$eb4f1190$c1ed34b0$ at com>

On Fri, Jul 24, 2015 at 01:04:40PM +0100, Wilco Dijkstra wrote:
> > OndÅej BÃlka wrote:
> > On Thu, Jul 23, 2015 at 01:54:27PM +0100, Wilco Dijkstra wrote:
> > First is what we need to make explicit. You argue both that you use
> > benchmark just to show that inlining provides speedup and to justify
> > claims. You cannot have it both ways.
> 
> Of course I can. This is not at all an either/or issue nor a contentious
> statement. The existing inlines are pretty inefficient so it shouldn't
> be a surprise to anyone that they can be easily beaten. And all the
> evidence shows that my patch improves performance of the existing inlines.
>
As you selected a more difficult route your benchmark is just wrong. 

It claims that finite builtin is faster than inline. But when I look at
actual benchtests I reach opposite conclusion. I ran benchtest three
times on haswell and consistently get that replacing finite in pow
function by isfinite causes regression.

I also checked on core2 and ivy bridge, core2 shows no difference, on
ivy bridge inlines are also faster than builtin.
 
current timing:

{"timing_type": "hp_timing",
 "functions": {
  "pow": {
   "": {
    "duration": 3.50115e+10,
    "iterations": 2.44713e+08,
    "max": 882.983,
    "min": 47.615,
    "mean": 143.072
   },
   "240bits": {
    "duration": 3.5122e+10,
    "iterations": 1.3e+06,
    "max": 33208,
    "min": 21679.5,
    "mean": 27016.9
   },
   "768bits": {
    "duration": 4.24365e+10,
    "iterations": 101000,
    "max": 442505,
    "min": 205110,
    "mean": 420163
   }
  },

with builtin:

{"timing_type": "hp_timing",
 "functions": {
  "pow": {
   "": {
    "duration": 3.49986e+10,
    "iterations": 2.4381e+08,
    "max": 868.965,
    "min": 47.612,
    "mean": 143.549
   },
   "240bits": {
    "duration": 3.5529e+10,
    "iterations": 1.4e+06,
    "max": 30331.3,
    "min": 20436,
    "mean": 25377.9
   },
   "768bits": {
    "duration": 3.98189e+10,
    "iterations": 101000,
    "max": 417778,
    "min": 194699,
    "mean": 394246
   }
  },



> My goal is to use the fastest possible implementation in all cases. 
> What would the point be to add fast inlines to math.h but keeping the 
> inefficient internal inlines? That makes no sense whatsoever unless 
> your goal is to keep GLIBC slow...
> 
Then you are forced to take more difficult route. It isn't about
inefficient inlines but also about any other inlines introduced. As
performance varies wildly on how do you benchmark it you will select
implementation that best on benchmark. That doesn't have anything in
common in practical performance.

> > If you also want to make claims about builtins/current inlines then you
> > need to make a benchmark that can accurately measure inlines and
> > builtins to see what is correct and what is wrong. Thats quite hard as I
> > said as it depends how gcc will optimize implementation.
> 
> I am confident that my benchmark does a great job. I looked at the
> assembly code of all functions and concluded that GCC6 compiled the
> code exactly as I expected and so it is measuring precisely what I
> intended to measure.
> 
And what you measure? 
> > Here there are still unresolved issues from previous patches.
> > 
> > First you still test on x64 without getting EXTRACT_WORDS64
> > from math_private. Without that you don't measure
> > current inlines.
> 
> I ran my benchmark with the movq instruction and it doesn't change the
> conclusion in any way (store-load forwarding is pretty quick).
> 
> So this is a non-issue - it is simply a bug in the x64 GCC that should
> be fixed and yet another reason why it is better to use a builtin that
> generates more optimal code without needing inline assembler.
> 
Problem is that this could be cpu specific where it could matter in some
cpu. Also its just matter that just measure correct function, telling
that it doesn't matter in one case does not imply that it wouldn't
matter for different workload.

> > First as isinf you ommited current isinf inline. As its faster than
> > isinf_ns and builtin which you check.
> 
> There is no isinf inline. I just kept the 3 math_private.h inlines.
> 
Thanks for clarification.

> > Then remainder test is wrong. It is inlined but kernel_standard isn't.
> > As you wanted to used it to measure performance of noninline function it
> > obviously doesn't measure that.
> 
> Kernel_standard should not be inlined as the real implementation isn't.
> 
Now when I looked you test different function than remainder as real
implementation calls __ieee754_remainder and handles differently
overflow. This could also make difference so 

> > When you fix that it clearly shows that current inlines are better on
> > x64.
> 
> No that's not what it shows. In all cases the new inlines are faster.
> 
> >    "remainder_test1": {
> >     "normal": {
> >      "duration": 4.23772e+06,
> >      "iterations": 500,
> >      "mean": 8475
> >     }
> >    },
> >    "remainder_test2": {
> >     "normal": {
> >      "duration": 4.45968e+06,
> >      "iterations": 500,
> >      "mean": 8919
> >     }
> >    }
> 
> Those results are way too fast, so something must be wrong with your changes.
>
I got timeout for that so I decreased size. That doesn't change outcome
that with fixed test current inlines are faster. 

Also it does not change that you have bug there and should fix it, and
this isn't real remainder as you need to call ieee754 remainder which is
different workload as you need to do second check there and OoO
execution could help you.
 
> To conclude none your issues are actual issues with my benchmark.
> 
No, there are still plenty of issues. You need to be more careful what
you do benchmark.

Follow-Ups:
- RE: [PATCH v3] Add math-inline benchmark
  - From: Wilco Dijkstra

References:
- [PATCH v3] Add math-inline benchmark
  - From: Wilco Dijkstra
- Re: [PATCH v3] Add math-inline benchmark
  - From: OndÅej BÃlka
- RE: [PATCH v3] Add math-inline benchmark
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]