This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v2] Add math-inline benchmark


On Fri, Jul 17, 2015 at 02:26:53PM +0100, Wilco Dijkstra wrote:
> > OndÅej BÃlka wrote:
> > On Thu, Jul 16, 2015 at 12:15:19PM +0100, Wilco Dijkstra wrote:
> > > Add a benchmark for isinf/isnan/isnormal/isfinite/fpclassify. This new version adds explicit
> > tests
> > > for the GCC built-ins and uses json format as suggested and no longer includes any string
> > headers.
> > > The test uses 2 arrays with 1024 doubles, one with 99% finite FP numbers (10% zeroes, 10%
> > negative)
> > > and 1% inf/NaN, the other with 50% inf, and 50% Nan.
> > >
> > > Results shows that using the GCC built-ins in math.h gives huge speedups due to avoiding
> > explict
> > > calls, PLT indirection to execute a function with 3-4 instructions - around 7x on AArch64
> > and 2.8x
> > > on x64. The GCC builtins have better performance than the existing math_private inlines for
> > __isnan,
> > > __finite and __isinf_ns, so these should be removed.
> > >
> > No, this benchmark is invalid for following two reasons.
> > 
> > 1) It doesn't measure real workload at all. Constructing large constant
> > could be costy and by inlining this benchmark ignores cost.
> 
> It was never meant to measure a real workload. If you'd like to add your own 
> workload, that would be great, but my micro benchmark is more than sufficient
> in proving that the new inlines give a huge performance gain.
> 
But you claimed following in original mail which is wrong:

"
Results shows that using the GCC built-ins in math.h gives huge speedups due to avoiding explict
calls, PLT indirection to execute a function with 3-4 instructions - around 7x on AArch64 and 2.8x
on x64. The GCC builtins have better performance than the existing math_private inlines for __isnan,
__finite and __isinf_ns, so these should be removed.
"

Also when inlines give speedup you should also add math inlines for
signaling nan case. That gives similar speedup. And it would be natural
to ask if you should use these inlines everytime if they are already
faster than builtins.

> > 2) Results on x64 don't measure inlines but inferior version as they use
> > assembly to change double into integer.
> > 
> > As I and Joseph told you multiple times to measure these and I send
> > benchmarks to demonstrate its effects. So I fixed your benchmark and now
> > it clearly shows that in all cases math_private inlines are better than
> > builtins on x64 (Now its x64 only due EXTRACT_WORDS64, you need to sync that with math-
> > private).
> > Even remainder is slower.
> 
> It shouldn't be necessary to use assembly code, GCC is perfectly capable of
> generating efficient code for EXTRACT_WORDS. If GCC doesn't for x64 then you
> should report it and ensure it gets fixed. I don't understand why people keep
> using target-specific hacks and assembly rather than fixing the underlying
> issues and use generic code that works well on all targets...
>
Thats problem its bit of special casing, if everybody wanted to add
special case assembly hack then gcc would be huge mess of these. As for
this I will open pr to fix that/add something like builtin_convert_fp_i.
 
> > So at least on x64 we should publish math_private inlines instead using
> > slow builtins.
> 
> Well it was agreed we are going to use the GCC built-ins and then improve
> those. If you want to propose additional patches with special inlines for
> x64 then please go ahead, but my plan is to improve the builtins.
>
And how are you sure that its just isolated x64 case. It may also happen
on powerpc, arm, sparc and other architectures and you need to test
that.

So I ask you again to run my benchmark with changed EXTRACT_WORDS64 to
see if this is problem also and arm.

> > I also added better finite,isnormal inlines, these should have same
> > speed as isnan. Main improvement is multiplication by 2 instead anding
> > with constant to mask sign bit.
> 
> Your inlines may well be very fast but they are not correct, eg. this doesn't
> check the exponent at all:
>
Most are correct. For isinf I was testing if this heuristic helps as
most floats have nonzero mantissa. Following is correct, performance
should be same but it depends on benchmark, how often it sets nonzero
mantissa.

 if (__builtin_expect ((di << 12) != 0, 1))
    return 0;
 if ((di << 1) != 0xffe0000000000000ull)
    return 0;

 return (int) (di >> 32);

 
> > +extern __always_inline int
> > +isinf_new (double dx)
> > +{
> > +  uint64_t di;
> > +  EXTRACT_WORDS64 (di, dx);
> > +
> > +  if (__builtin_expect ((di << 12) != 0, 1))
> > +    return 0;
> > +
> > +  return (int) (di >> 32);
> > +}


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]