This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
RE: [PATCH v2] Add math-inline benchmark
- From: "Wilco Dijkstra" <wdijkstr at arm dot com>
- To: 'OndÅej BÃlka' <neleai at seznam dot cz>
- Cc: "'GNU C Library'" <libc-alpha at sourceware dot org>
- Date: Tue, 21 Jul 2015 17:14:51 +0100
- Subject: RE: [PATCH v2] Add math-inline benchmark
- Authentication-results: sourceware.org; auth=none
- References: <002001d0bfb8$b36fa330$1a4ee990$ at com> <20150716225056 dot GA24479 at domone> <002501d0c094$3ea04cd0$bbe0e670$ at com> <20150718113423 dot GC30356 at domone> <002a01d0c2db$7ad80c30$70882490$ at com> <20150720192216 dot GA2019 at domone>
> OndÅej BÃlka wrote:
> On Mon, Jul 20, 2015 at 12:01:50PM +0100, Wilco Dijkstra wrote:
> > > OndÅej BÃlka wrote:
> > > On Fri, Jul 17, 2015 at 02:26:53PM +0100, Wilco Dijkstra wrote:
> > > But you claimed following in original mail which is wrong:
> > >
> > > "
> > > Results shows that using the GCC built-ins in math.h gives huge speedups due to avoiding
> > > explict
> > > calls, PLT indirection to execute a function with 3-4 instructions - around 7x on AArch64
> and
> > > 2.8x
> > > on x64. The GCC builtins have better performance than the existing math_private inlines
> for
> > > __isnan,
> > > __finite and __isinf_ns, so these should be removed.
> > > "
> >
> > No that statement is 100% correct.
> >
> As for isinf_ns on some architectures current isinf inline is better so
> it should be replaced by that instead.
The current inline (__isinf_ns) is not better than the builtin.
> Your claim about __finite builtin is definitely false on x64, its
> slower:
No that's not what I am getting on x64. With the movq instruction
and inlining as my original patch:
"__finite_inl_t": {
"normal": {
"duration": 1.59802e+06,
"iterations": 500,
"mean": 3196
}
},
"__isfinite_builtin_t": {
"normal": {
"duration": 1.48391e+06,
"iterations": 500,
"mean": 2967
}
},
With movq but inlining disabled:
"__finite_inl_t": {
"normal": {
"duration": 3.09839e+06,
"iterations": 500,
"mean": 6196
}
},
"__isfinite_builtin_t": {
"normal": {
"duration": 2.6429e+06,
"iterations": 500,
"mean": 5285
}
},
So the benefit of the builtin increases when we can't lift the immediate.
> > > Also when inlines give speedup you should also add math inlines for
> > > signaling nan case. That gives similar speedup. And it would be natural
> > > to ask if you should use these inlines everytime if they are already
> > > faster than builtins.
> >
> > I'm not sure what you mean here - I enable the new inlines in exactly the
> > right case. Improvements to support signalling NaNs or to speedup the
> > built-ins further will be done in GCC.
> >
> Why we cant just use
> #ifdef __SUPPORT_SNAN__
> math inline
> #else
> builtin
> #endif
That's a lot of work for little gain giving signalling NaNs are not enabled often.
> > It's obvious the huge speedup applies to all other architectures as well -
> > it's hard to imagine that avoiding a call, a return, a PLT indirection and
> > additional optimization of 3-4 instructions could ever cause a slowdown...
> >
> But I didn't asked about that. I asked that you made bug x64 specific
> but its not clear all all if other architectures are affected or not. So
> you must test these.
All architectures are affected as they will get the speedup from the inlining.
> > > So I ask you again to run my benchmark with changed EXTRACT_WORDS64 to
> > > see if this is problem also and arm.
> >
> > Here are the results for x64 with inlining disabled (__always_inline changed
> > into noinline) and the movq instruction like you suggested:
> >
> I asked to run my benchmark on arm, not your benchmark on x64. As you described
> modifications it does measure just performance of noninline function
> which we don't want. There is difference in performance how gcc
> optimizes that so you must surround entire expression in noinline. For
> example gcc optimizes
>
> # define isinf(x) (noninf(x) ? (x == 1.0 / 0.0 ? 1 : 0))
> if (isinf(x))
> foo()
>
> into
>
> if (!noninf)
> foo()
Which is exactly what we want to happen (and why the non-inlined isinf results
are not interesting as 99.9% of times we do the optimization).
> > "__isnan_t": {
> > "normal": {
> > "duration": 3.52048e+06,
> > "iterations": 500,
> > "mean": 7040
> > }
> > },
> > "__isnan_inl_t": {
> > "normal": {
> > "duration": 3.09247e+06,
> > "iterations": 500,
> > "mean": 6184
> > }
> > },
> > "__isnan_builtin_t": {
> > "normal": {
> > "duration": 2.20378e+06,
> > "iterations": 500,
> > "mean": 4407
> > }
> > },
> > "isnan_t": {
> > "normal": {
> > "duration": 1.50514e+06,
> > "iterations": 500,
> > "mean": 3010
> > }
> > },
>
> why is isnan faster than builtin?
You asked for results with inlining turned off, so basically this shows the difference
between inlining the isnan builtin into a loop and calling a function which has the
isnan builtin inlined.
> > "__isnormal_inl2_t": {
> > "normal": {
> > "duration": 2.18113e+06,
> > "iterations": 500,
> > "mean": 4362
> > }
> > },
> > "__isnormal_builtin_t": {
> > "normal": {
> > "duration": 3.08183e+06,
> > "iterations": 500,
> > "mean": 6163
> > }
> > },
>
> also here why got isnormal2 so good?
isnormal_inl2_t is still inlined of course as it is a define.
Wilco