This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Inline C99 math functions

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Wilco Dijkstra <wdijkstr at arm dot com>
Cc: 'Joseph Myers' <joseph at codesourcery dot com>, GNU C Library <libc-alpha at sourceware dot org>
Date: Mon, 15 Jun 2015 20:52:01 +0200
Subject: Re: [PATCH] Inline C99 math functions
Authentication-results: sourceware.org; auth=none
References: <001201d0a75b$921d9860$b658c920$ at com> <alpine dot DEB dot 2 dot 10 dot 1506151431490 dot 26683 at digraph dot polyomino dot org dot uk> <001701d0a789$f2ab86f0$d80294d0$ at com>

On Mon, Jun 15, 2015 at 05:40:11PM +0100, Wilco Dijkstra wrote:
> > Joseph Myers wrote:
> > On Mon, 15 Jun 2015, Wilco Dijkstra wrote:
> > 
> > > Add inlining of the C99 math functions
> > > isinf/isnan/signbit/isfinite/isnormal/fpclassify using GCC built-ins
> > > when available. Since going through the PLT is expensive for these small
> > > functions, inlining results in major speedups (about 7x on Cortex-A57
> > > for isinf). The GCC built-ins are not correct if signalling NaN support
> > 
> > Where are the benchmarks for this?  Please put them in benchtests so
> > actual reproducible figures can be given.  That's the standard practice
> > for any change being justified on the basis of performance.
> 
> I'll add a benchmark in another patch - it's not trivial as benchtest is not
> suitable to accurately time very simple functions, especially when inlined...
>
As I wrote in other thread that gcc builtins have poor performance a
benchmark is tricky. Main problem is that these tests are in branch and
gcc will simplify them. As builtins are branchless it obviously couldn't
simplify them.

You could reuse my example to show problem with builtins as benchtest,
you could make worse mistakes and inlining would still save a day.

You will get different characteristic in following three cases. In case
1 isinf is simplified by surrounding if so we measure time of branch
never taken.

Case 2 is identical except that in case 1 conversion to branchless 
ret += 42 * __builtin_isinf(d[i])
migth make sense where gcc could make isinf return 1 when its infinity.

In case 3 you get raw sum, difference between builtin and optimized
inline is bigger, mainly because of dependency chain.

Then numbers you get are also architecture specific so its hard to guess
exact numbers except that it will help as inlining gives huge boost.

For other functions performance should be identical on optimized inline
as it just checks that exponent is 1023 and fails, what happens after
that check is irrelevant.

On my ivy bridge I get following where bti is optimized inline of case i
and nti is gcc builtin for same case.

That performance becomes different when you decide to isolate builtin
and inline expansion into separately compiled function as loops gain a
lot by not having to initialize constants.

So you first need to tell what are your benchmark characteristic before
writing benchmark. A sample follows


time ./bt1

real	0m0.788s
user	0m0.772s
sys	0m0.000s
19:44:18:~$ time ./nt1

real	0m1.162s
user	0m1.136s
sys	0m0.004s
19:44:23:~$ time ./bt2

real	0m0.777s
user	0m0.760s
sys	0m0.000s
19:44:28:~$ time ./nt2

real	0m1.167s
user	0m1.144s
sys	0m0.000s
19:44:32:~$ time ./bt3

real	0m1.161s
user	0m1.140s
sys	0m0.000s
19:44:37:~$ time ./nt3

real	0m1.774s
user	0m1.740s
sys	0m0.000s






#ifdef BRANCHED
static inline int
isinf (double dx)
{
  union u {
    double d;
    long l;  
  };
  union u u;
  u.d = dx;
  long x = u.l;
  return 2 * x == 0xffe0000000000000 ? (x == 0x7ff0000000000000 ? 1 :
-1) : 0;
}
#endif

int main()
{
  double ret= 0.0;
  int i, j;
  double *d = malloc (800000);
  for (j=0; j<1000000; j++)
#ifdef T1
  for (i=0; i<1000; i++)
    if (__builtin_expect(isinf (d[i]),0))
      ret += 42;                         
#endif
#ifdef T2
  for (i=0; i<1000; i++)
    if (__builtin_expect(isinf (d[i]),0))
      abort();
#endif
#ifdef T3
  for (i=0; i<1000; i++)
    ret += __builtin_expect(isinf (d[i]),0);
#endif

  return ret;
}


 
> > What are the effects on code size (especially for fpclassify)?  If code
> > becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be
> > considered.
> 
> Codesize of what? Few applications use these functions... GLIBC mathlib is 
> the main example, but it's not typical as it has extremely high usage of these
> functions and you'll unlikely want to avoid inlining as the performance gain
> and avoidance of explicit calls are significant benefits in the math functions 
> (no need to use callee-saves).
> 
> I suppose adding a size check for fpclassify would be reasonable as the 
> expansion ends up quite large (but that should really be fixed). 
> Note fpclassify is used rather inappropriately in most cases in GLIBC, so
> the best approach would be rewrite those to use faster isnan/isinf checks.
> 
Remember cost of cache, you are really optimizing for performance when
assuming that instruction cache misses are problem. It is worth to
optimize even by increasing size, just be aware of instruction cost.
Rule of thumb is that extra byte should save a cycle.

fpclassify should have around same impact as cache usage as isinf. Just
check of exponent would ever be in cache. gcc should move rest of function to separate 
cache line with cold code that is almost never read and obviously has no impact on performance.

Also my guess is that it would improve size in most cases but it depends
on caller. Also you shouldn't rewrite that to isnan/isinf checks as that would
probably harm performance. 

Again reason is that gcc will delete unused branches so function size
depends on what do you check and is smaller than separate checks. Using
your builtin on x64 you will get following assembly

   int result =  __builtin_fpclassify (FP_NAN, FP_INFINITE, \
     FP_NORMAL, FP_SUBNORMAL, FP_ZERO, d[i]);

followed by

if (result == FP_NORMAL)

  48:	f2 0f 10 02          	movsd  (%rdx),%xmm0
  4c:	66 0f 54 c1          	andpd  %xmm1,%xmm0
  50:	66 0f 2e c0          	ucomisd %xmm0,%xmm0
  54:	7a 10                	jp     66 <main+0x66>
  56:	66 0f 2e c2          	ucomisd %xmm2,%xmm0
  5a:	77 0a                	ja     66 <main+0x66>
  5c:	66 0f 2e c4          	ucomisd %xmm4,%xmm0
  60:	72 04                	jb     66 <main+0x66>
  62:	f2 0f 58 dd          	addsd  %xmm5,%xmm3


or 

if (result == FP_NAN)

  41:	f2 0f 10 02          	movsd  (%rdx),%xmm0
  45:	66 0f 54 c1          	andpd  %xmm1,%xmm0
  49:	66 0f 2e c0          	ucomisd %xmm0,%xmm0
  4d:	7b e9                	jnp    38 <main+0x38>

or

if (result == FP_INFINITE || result == FP_ZERO)

  48:	f2 0f 10 02          	movsd  (%rdx),%xmm0
  4c:	66 0f 54 c2          	andpd  %xmm2,%xmm0
  50:	66 0f 2e c0          	ucomisd %xmm0,%xmm0
  54:	7a 06                	jp     5c <main+0x5c>
  56:	66 0f 2e c3          	ucomisd %xmm3,%xmm0
  5a:	76 04                	jbe    60 <main+0x60>


Note that FP_NORMAL has poor performance and you should replace it by
inline as none of three branches there is taken, while what I submitted
in review thread has one branch not taken.

> > > As a result of this many target overrides and the various
> > > __isnan/__finite inlines in math_private.h are no longer required. If
> > > agreed we could remove all this code and only keep the generic
> > > definition of isinf/etc which will use the builtin.
> > 
> > How do those inlines compare with the code GCC generates?  As I understand
> > it, before your patch libm-internal calls to these macros would have
> > called the inline versions of functions such as __finite (no function
> > call, PLT or otherwise, and using integer arithmetic), and after the patch
> > libm-internal calls would have used the GCC inlines (with floating-point
> > arithmetic) - is the new state for libm-internal calls better than the old
> > state or not?
> 
> __isinf(f/l) were never inlined, but are now (__isinf_ns was used in some cases,
> but not nearly all). Fpclassify*, isnormal* were never inlined, but now are. These
> new inlines provide the gains.
> 
> __isnan(f) and __finite(f) used GLIBC internal inlines, and now use GCC built-ins 
> instead. We could measure whether this specific change causes any difference, but 
> I bet it will be very uarch specific and minor compared to the total time math
> functions take.
> 
> Also I don't think having special inlines that are only used inside GLIBC is a 
> good approach - if the GCC built-ins are not fast enough then we should fix them.
>

Follow-Ups:
- Re: [PATCH] Inline C99 math functions
  - From: Joseph Myers

References:
- [PATCH] Inline C99 math functions
  - From: Wilco Dijkstra
- Re: [PATCH] Inline C99 math functions
  - From: Joseph Myers
- RE: [PATCH] Inline C99 math functions
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]