This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [PATCH v2] Add math-inline benchmark


> OndÅej BÃlka wrote:
> On Mon, Jul 20, 2015 at 12:01:50PM +0100, Wilco Dijkstra wrote:
> > > OndÅej BÃlka wrote:
> > > On Fri, Jul 17, 2015 at 02:26:53PM +0100, Wilco Dijkstra wrote:
> > > But you claimed following in original mail which is wrong:
> > >
> > > "
> > > Results shows that using the GCC built-ins in math.h gives huge speedups due to avoiding
> > > explict
> > > calls, PLT indirection to execute a function with 3-4 instructions - around 7x on AArch64
> and
> > > 2.8x
> > > on x64. The GCC builtins have better performance than the existing math_private inlines
> for
> > > __isnan,
> > > __finite and __isinf_ns, so these should be removed.
> > > "
> >
> > No that statement is 100% correct.
> >
> As for isinf_ns on some architectures current isinf inline is better so
> it should be replaced by that instead.

The current inline (__isinf_ns) is not better than the builtin.

> Your claim about __finite builtin is definitely false on x64, its
> slower:

No that's not what I am getting on x64. With the movq instruction
and inlining as my original patch:

   "__finite_inl_t": {
    "normal": {
     "duration": 1.59802e+06,
     "iterations": 500,
     "mean": 3196
    }
   },
   "__isfinite_builtin_t": {
    "normal": {
     "duration": 1.48391e+06,
     "iterations": 500,
     "mean": 2967
    }
   },

With movq but inlining disabled:

   "__finite_inl_t": {
    "normal": {
     "duration": 3.09839e+06,
     "iterations": 500,
     "mean": 6196
    }
   },
   "__isfinite_builtin_t": {
    "normal": {
     "duration": 2.6429e+06,
     "iterations": 500,
     "mean": 5285
    }
   },

So the benefit of the builtin increases when we can't lift the immediate.

> > > Also when inlines give speedup you should also add math inlines for
> > > signaling nan case. That gives similar speedup. And it would be natural
> > > to ask if you should use these inlines everytime if they are already
> > > faster than builtins.
> >
> > I'm not sure what you mean here - I enable the new inlines in exactly the
> > right case. Improvements to support signalling NaNs or to speedup the
> > built-ins further will be done in GCC.
> >
> Why we cant just use
> #ifdef __SUPPORT_SNAN__
> math inline
> #else
> builtin
> #endif

That's a lot of work for little gain giving signalling NaNs are not enabled often.

> > It's obvious the huge speedup applies to all other architectures as well -
> > it's hard to imagine that avoiding a call, a return, a PLT indirection and
> > additional optimization of 3-4 instructions could ever cause a slowdown...
> >
> But I didn't asked about that. I asked that you made bug x64 specific
> but its not clear all all if other architectures are affected or not. So
> you must test these.

All architectures are affected as they will get the speedup from the inlining.

> > > So I ask you again to run my benchmark with changed EXTRACT_WORDS64 to
> > > see if this is problem also and arm.
> >
> > Here are the results for x64 with inlining disabled (__always_inline changed
> > into noinline) and the movq instruction like you suggested:
> >
> I asked to run my benchmark on arm, not your benchmark on x64. As you described
> modifications it does measure just performance of noninline function
> which we don't want. There is difference in performance how gcc
> optimizes that so you must surround entire expression in noinline. For
> example gcc optimizes
> 
> # define isinf(x) (noninf(x) ? (x == 1.0 / 0.0 ? 1 : 0))
> if (isinf(x))
>   foo()
> 
> into
> 
> if (!noninf)
>   foo()

Which is exactly what we want to happen (and why the non-inlined isinf results
are not interesting as 99.9% of times we do the optimization).

> >    "__isnan_t": {
> >     "normal": {
> >      "duration": 3.52048e+06,
> >      "iterations": 500,
> >      "mean": 7040
> >     }
> >    },
> >    "__isnan_inl_t": {
> >     "normal": {
> >      "duration": 3.09247e+06,
> >      "iterations": 500,
> >      "mean": 6184
> >     }
> >    },
> >    "__isnan_builtin_t": {
> >     "normal": {
> >      "duration": 2.20378e+06,
> >      "iterations": 500,
> >      "mean": 4407
> >     }
> >    },
> >    "isnan_t": {
> >     "normal": {
> >      "duration": 1.50514e+06,
> >      "iterations": 500,
> >      "mean": 3010
> >     }
> >    },
> 
> why is isnan faster than builtin?

You asked for results with inlining turned off, so basically this shows the difference
between inlining the isnan builtin into a loop and calling a function which has the
isnan builtin inlined.

> >    "__isnormal_inl2_t": {
> >     "normal": {
> >      "duration": 2.18113e+06,
> >      "iterations": 500,
> >      "mean": 4362
> >     }
> >    },
> >    "__isnormal_builtin_t": {
> >     "normal": {
> >      "duration": 3.08183e+06,
> >      "iterations": 500,
> >      "mean": 6163
> >     }
> >    },
> 
> also here why got isnormal2 so good?

isnormal_inl2_t is still inlined of course as it is a define.

Wilco



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]