This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [PATCH] Inline C99 math functions


> OndÅej BÃlka wrote:
> On Tue, Jun 16, 2015 at 06:53:39PM +0100, Wilco Dijkstra wrote:
> > > OndÅej BÃlka wrote:
> > > On Tue, Jun 16, 2015 at 04:53:11PM +0100, Wilco Dijkstra wrote:
> > > > I added a new math-inlines benchmark based on the string benchmark infrastructure.
> > > > I used 2x1024 inputs, one 99% finite FP numbers (20% zeroes) and 1% inf/NaN,
> > > > and the 2nd with 50% inf, and 50% Nan. Here are the relative timings for Cortex-A57:
> > > >
> > > Where is benchmark, there are several things that could go wrong with it.
> >
> > I'll send it when I can (it has to go through review etc).
> >
> > > > __fpclassify_t:	8.76	7.04
> > > > fpclassify_t:	4.91	5.17
> > >
> > > > __isnormal_inl_t:	8.77	7.16
> > > > isnormal_t:		3.16	3.17
> > >
> > > Where did you get inline? I couldn't find it anywhere. Also such big
> > > number for inline implementation is suspect
> >
> > It does (__fpclassify (x) == FP_NORMAL) like math.h which is obviously a bad
> > idea and the reason for the low performance. Although the GCC isnormal builtin
> > is not particularly fast, it still beats it by more than a factor of 2.
> >
> No, bad idea was not inlining fpclassify, that affects most of performance difference.
> There is also problem that glibcdev/glibc/sysdeps/ieee754/dbl-64/s_fpclassify.c is bit slow as
> it tests unlikely cases first but that is secondary.

Even with the inlined fpclassify (inl2 below), isnormal is slower:

__isnormal_inl2_t:	1.25	3.67
__isnormal_inl_t:	4.59	2.89
isnormal_t:	1	1

So using a dedicated builtin for isnormal is important.

> > It's certainly correct, but obviously different microarchitectures will show
> > different results. Note the GLIBC private inlines are not particularly good.
> >
> No, problem is that different benchmarks show different results on same
> architecture. To speed things up run following to test all cases of
> environment. Run attached tf script to get results on arm.

I tried, but I don't think this is a good benchmark - you're not measuring
the FP->int move for the branched version, and you're comparing the signed
version of isinf vs the builtin which does isinf_ns.

> Which doesn't matter. As gcc optimized unneded checks away you won't do
> unneeded checks. As using:
> 
> __builtin_fpclassify (FP_NAN, FP_INFINITE,           \
>      FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0);
>   return result == FP_INFINITE || result == FP_NAN;
> 
> is slower than:
> 
>  return __builtin_isinf (x) ||  __builtin_isnan (x);
> 
> Your claim is false, run attached tf2 script to test.

That's not what I am seeing, using two explicit isinf/isnan calls (test2) is faster
than inlined fpclassify (test1):

__fpclassify_test2_t:	1	4.41
__fpclassify_test1_t:	1.23	4.66

Wilco



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]