This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [PATCH v2] Add math-inline benchmark


> OndÅej BÃlka wrote:
> On Tue, Jul 21, 2015 at 05:14:51PM +0100, Wilco Dijkstra wrote:
> > > OndÅej BÃlka wrote:
> > > On Mon, Jul 20, 2015 at 12:01:50PM +0100, Wilco Dijkstra wrote:
> > > > > OndÅej BÃlka wrote:
> > > > > On Fri, Jul 17, 2015 at 02:26:53PM +0100, Wilco Dijkstra wrote:
> > > > > But you claimed following in original mail which is wrong:
> > > > >
> > > > > "
> > > > > Results shows that using the GCC built-ins in math.h gives huge speedups due to
> avoiding
> > > > > explict
> > > > > calls, PLT indirection to execute a function with 3-4 instructions - around 7x on
> AArch64
> > > and
> > > > > 2.8x
> > > > > on x64. The GCC builtins have better performance than the existing math_private
> inlines
> > > for
> > > > > __isnan,
> > > > > __finite and __isinf_ns, so these should be removed.
> > > > > "
> > > >
> > > > No that statement is 100% correct.
> > > >
> > > As for isinf_ns on some architectures current isinf inline is better so
> > > it should be replaced by that instead.
> >
> > The current inline (__isinf_ns) is not better than the builtin.
> >
> Thats correct as I also said that. But it doesn't change my point.
> isinf_ns is slower than builtin. builtin is on x64 slower that current
> inline. 
>
> So we should replace internal isinf_ns with current inline to
> get better performance.

Sure it is certainly possible to improve upon the built-ins. But that
was never the goal of my patch, so it is not relevant to this discussion.

> > > Your claim about __finite builtin is definitely false on x64, its
> > > slower:
> >
> > No that's not what I am getting on x64. With the movq instruction
> > and inlining as my original patch:
> >
> Thats only your patch. It isn't clear at all if that happens in reality.
> My benchmark shows that it don't.
> 
> So Carlos we have two benchmarks each saying that one alternative is
> better than other, how you solve that?

You'd solve it by showing a particular inline is faster than the builtin on
some GLIBC math functions and propose a patch.

> > > > It's obvious the huge speedup applies to all other architectures as well -
> > > > it's hard to imagine that avoiding a call, a return, a PLT indirection and
> > > > additional optimization of 3-4 instructions could ever cause a slowdown...
> > > >
> > > But I didn't asked about that. I asked that you made bug x64 specific
> > > but its not clear all all if other architectures are affected or not. So
> > > you must test these.
> >
> > All architectures are affected as they will get the speedup from the inlining.
> >
> Again this is irrelevant. Current inlines could be faster than builtins
> on most architectures so you should test that and open gcc bugs.

Since the patch has been out for review, people have been able to test it for 
their architecture. Anyone can suggest further improvements if they want even more 
performance. There is already a GCC bug to improve the built-ins.

> > > > > So I ask you again to run my benchmark with changed EXTRACT_WORDS64 to
> > > > > see if this is problem also and arm.
> > > >
> > > > Here are the results for x64 with inlining disabled (__always_inline changed
> > > > into noinline) and the movq instruction like you suggested:
> > > >
> > > I asked to run my benchmark on arm, not your benchmark on x64. As you described
> > > modifications it does measure just performance of noninline function
> > > which we don't want. There is difference in performance how gcc
> > > optimizes that so you must surround entire expression in noinline. For
> > > example gcc optimizes
> > >
> > > # define isinf(x) (noninf(x) ? (x == 1.0 / 0.0 ? 1 : 0))
> > > if (isinf(x))
> > >   foo()
> > >
> > > into
> > >
> > > if (!noninf)
> > >   foo()
> >
> > Which is exactly what we want to happen (and why the non-inlined isinf results
> > are not interesting as 99.9% of times we do the optimization).
> >
> That isn't completely correct. While this happens most callers do is*
> check only once at start so loading constants costs you. So you need to
> be careful how you write benchmark to get correct result like I did in
> mine.

Actually almost all uses within GLIBC use the macros multiple times within the 
same function, so the immediates can be shared (sometimes even between 2 different
macros, and any fabs or FP->int moves can be shared too).

> > > >    "isnan_t": {
> > > >     "normal": {
> > > >      "duration": 1.50514e+06,
> > > >      "iterations": 500,
> > > >      "mean": 3010
> > > >     }
> > > >    },
> > >
> > > why is isnan faster than builtin?
> >
> > You asked for results with inlining turned off, so basically this shows the difference
> > between inlining the isnan builtin into a loop and calling a function which has the
> > isnan builtin inlined.
> >
> I never asked to do that. I said that you should measure expression
> using isinf surrounded in noinline function to avoid reusing constants
> in loop.
>
> That will give you different results than you get.

No that wouldn't make any difference - all that matters for that experiment was 
encapsulating the immediate in a separate function, you'll get one call every 
iteration either way.

Wilco



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]