This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
RE: [PATCH v2] Add math-inline benchmark
- From: "Wilco Dijkstra" <wdijkstr at arm dot com>
- To: 'OndÅej BÃlka' <neleai at seznam dot cz>
- Cc: "'GNU C Library'" <libc-alpha at sourceware dot org>
- Date: Tue, 21 Jul 2015 20:09:35 +0100
- Subject: RE: [PATCH v2] Add math-inline benchmark
- Authentication-results: sourceware.org; auth=none
- References: <002001d0bfb8$b36fa330$1a4ee990$ at com> <20150716225056 dot GA24479 at domone> <002501d0c094$3ea04cd0$bbe0e670$ at com> <20150718113423 dot GC30356 at domone> <002a01d0c2db$7ad80c30$70882490$ at com> <20150720192216 dot GA2019 at domone> <003001d0c3d0$5fb0a390$1f11eab0$ at com> <20150721174732 dot GB22893 at domone>
> OndÅej BÃlka wrote:
> On Tue, Jul 21, 2015 at 05:14:51PM +0100, Wilco Dijkstra wrote:
> > > OndÅej BÃlka wrote:
> > > On Mon, Jul 20, 2015 at 12:01:50PM +0100, Wilco Dijkstra wrote:
> > > > > OndÅej BÃlka wrote:
> > > > > On Fri, Jul 17, 2015 at 02:26:53PM +0100, Wilco Dijkstra wrote:
> > > > > But you claimed following in original mail which is wrong:
> > > > >
> > > > > "
> > > > > Results shows that using the GCC built-ins in math.h gives huge speedups due to
> avoiding
> > > > > explict
> > > > > calls, PLT indirection to execute a function with 3-4 instructions - around 7x on
> AArch64
> > > and
> > > > > 2.8x
> > > > > on x64. The GCC builtins have better performance than the existing math_private
> inlines
> > > for
> > > > > __isnan,
> > > > > __finite and __isinf_ns, so these should be removed.
> > > > > "
> > > >
> > > > No that statement is 100% correct.
> > > >
> > > As for isinf_ns on some architectures current isinf inline is better so
> > > it should be replaced by that instead.
> >
> > The current inline (__isinf_ns) is not better than the builtin.
> >
> Thats correct as I also said that. But it doesn't change my point.
> isinf_ns is slower than builtin. builtin is on x64 slower that current
> inline.
>
> So we should replace internal isinf_ns with current inline to
> get better performance.
Sure it is certainly possible to improve upon the built-ins. But that
was never the goal of my patch, so it is not relevant to this discussion.
> > > Your claim about __finite builtin is definitely false on x64, its
> > > slower:
> >
> > No that's not what I am getting on x64. With the movq instruction
> > and inlining as my original patch:
> >
> Thats only your patch. It isn't clear at all if that happens in reality.
> My benchmark shows that it don't.
>
> So Carlos we have two benchmarks each saying that one alternative is
> better than other, how you solve that?
You'd solve it by showing a particular inline is faster than the builtin on
some GLIBC math functions and propose a patch.
> > > > It's obvious the huge speedup applies to all other architectures as well -
> > > > it's hard to imagine that avoiding a call, a return, a PLT indirection and
> > > > additional optimization of 3-4 instructions could ever cause a slowdown...
> > > >
> > > But I didn't asked about that. I asked that you made bug x64 specific
> > > but its not clear all all if other architectures are affected or not. So
> > > you must test these.
> >
> > All architectures are affected as they will get the speedup from the inlining.
> >
> Again this is irrelevant. Current inlines could be faster than builtins
> on most architectures so you should test that and open gcc bugs.
Since the patch has been out for review, people have been able to test it for
their architecture. Anyone can suggest further improvements if they want even more
performance. There is already a GCC bug to improve the built-ins.
> > > > > So I ask you again to run my benchmark with changed EXTRACT_WORDS64 to
> > > > > see if this is problem also and arm.
> > > >
> > > > Here are the results for x64 with inlining disabled (__always_inline changed
> > > > into noinline) and the movq instruction like you suggested:
> > > >
> > > I asked to run my benchmark on arm, not your benchmark on x64. As you described
> > > modifications it does measure just performance of noninline function
> > > which we don't want. There is difference in performance how gcc
> > > optimizes that so you must surround entire expression in noinline. For
> > > example gcc optimizes
> > >
> > > # define isinf(x) (noninf(x) ? (x == 1.0 / 0.0 ? 1 : 0))
> > > if (isinf(x))
> > > foo()
> > >
> > > into
> > >
> > > if (!noninf)
> > > foo()
> >
> > Which is exactly what we want to happen (and why the non-inlined isinf results
> > are not interesting as 99.9% of times we do the optimization).
> >
> That isn't completely correct. While this happens most callers do is*
> check only once at start so loading constants costs you. So you need to
> be careful how you write benchmark to get correct result like I did in
> mine.
Actually almost all uses within GLIBC use the macros multiple times within the
same function, so the immediates can be shared (sometimes even between 2 different
macros, and any fabs or FP->int moves can be shared too).
> > > > "isnan_t": {
> > > > "normal": {
> > > > "duration": 1.50514e+06,
> > > > "iterations": 500,
> > > > "mean": 3010
> > > > }
> > > > },
> > >
> > > why is isnan faster than builtin?
> >
> > You asked for results with inlining turned off, so basically this shows the difference
> > between inlining the isnan builtin into a loop and calling a function which has the
> > isnan builtin inlined.
> >
> I never asked to do that. I said that you should measure expression
> using isinf surrounded in noinline function to avoid reusing constants
> in loop.
>
> That will give you different results than you get.
No that wouldn't make any difference - all that matters for that experiment was
encapsulating the immediate in a separate function, you'll get one call every
iteration either way.
Wilco