This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Inline C99 math functions

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Wilco Dijkstra <wdijkstr at arm dot com>
Cc: 'Joseph Myers' <joseph at codesourcery dot com>, GNU C Library <libc-alpha at sourceware dot org>
Date: Wed, 17 Jun 2015 20:51:22 +0200
Subject: Re: [PATCH] Inline C99 math functions
Authentication-results: sourceware.org; auth=none
References: <alpine dot DEB dot 2 dot 10 dot 1506151431490 dot 26683 at digraph dot polyomino dot org dot uk> <001701d0a789$f2ab86f0$d80294d0$ at com> <alpine dot DEB dot 2 dot 10 dot 1506151654100 dot 26683 at digraph dot polyomino dot org dot uk> <001801d0a84c$8c5cd7a0$a51686e0$ at com> <20150616164020 dot GA8970 at domone> <001901d0a85d$60857bd0$21907370$ at com> <20150617053502 dot GA13762 at domone> <001b01d0a911$be986160$3bc92420$ at com> <20150617163433 dot GA27278 at domone> <001e01d0a922$2d78fb70$886af250$ at com>

On Wed, Jun 17, 2015 at 06:22:24PM +0100, Wilco Dijkstra wrote:
> > OndÅej BÃlka wrote:
> > On Wed, Jun 17, 2015 at 04:24:46PM +0100, Wilco Dijkstra wrote:
> > > Even with the inlined fpclassify (inl2 below), isnormal is slower:
> > >
> > > __isnormal_inl2_t:	1.25	3.67
> > > __isnormal_inl_t:	4.59	2.89
> > > isnormal_t:	1	1
> > >
> > > So using a dedicated builtin for isnormal is important.
> > >
> > That makes result identical to one of isnan. That its slower is bug in
> > fpclassify which should first check for normal, then do unlikely checks.
> 
> It's about twice as slow as isnan as the isnormal check isn't done efficiently.
> Fpclassify is slower still as it does 3 comparisons before setting FP_NORMAL.
>
Then how could you explain original data? I couldn't find how you did
determined its twice slower, when it takes 1.25 which is faster than isinf 1.28 
__isnormal_inl_t:       8.77    7.16
isnormal_t:             3.16    3.17
__isinf_inl_t:  1.92    2.99
__isinf_t:              8.9     6.17
isinf_t:                1.28    1.28
__isnan_inl_t:  1.91    1.92
__isnan_t:              15.28   15.28
isnan_t:                1       1.01


 
> > > > > It's certainly correct, but obviously different microarchitectures will show
> > > > > different results. Note the GLIBC private inlines are not particularly good.
> > > > >
> > > > No, problem is that different benchmarks show different results on same
> > > > architecture. To speed things up run following to test all cases of
> > > > environment. Run attached tf script to get results on arm.
> > >
> > > I tried, but I don't think this is a good benchmark - you're not measuring
> > > the FP->int move for the branched version, and you're comparing the signed
> > > version of isinf vs the builtin which does isinf_ns.
> > >
> > Wilco that isinf is signed is again completely irrelevant. gcc is smart.
> > It get expanded into
> > if (foo ? (bar ? 1 : -1) : 0)
> > that gcc simplifies to
> > if (foo)
> > so it doesn't matter that checking sign would take 100 cycles as its
> > deleted code.
> 
> It matters for the -DT3 test.
> 
Only by increasing code size, it should have zero performance impact.
 It still does first check if positive/negative infinity and jump skips bar check.

> > Also as it could make branched version only slower when it beats builtin
> > then also nonsigned one would beat builtin.
> > 
> > And do you have assembly to show it doesn't measure move or its just
> > your guess? On x64 it definitely measures move and I could add that gcc
> > messes that bit by moving several times. objdump -d
> > on gcc   -DT2 -DBRANCHED  -DI1="__attribute((always_inline))" -
> > DI2="__attribute__((always_inline))" ft.c   -c
> > clearly shows that conversion is done several times.
> 
> Look at what the inner loop generates for T1 (T2/T3 do the same):
> 
> .L27:
>         ldr     x2, [x1]
>         cmp     x4, x2, lsl 1
>         bne     .L29
>         fadd    d0, d0, d1
> .L29:
>         add     x1, x1, 8
>         cmp     x1, x3
>         bne     .L27
> 
> > So will you publish results or not as it would show your builtins in
> > unfavorable ligth?
> 
> I don't see the point until your benchmark measures the right thing. 
> Note my benchmark carefully avoids this gotcha.
>
Looks like that arm gcc could optimize better than x64. I did addition
to force value into floating register. Then I needed to add volatile to
force gcc emit better code. Best would be use assembly but that makes
benchmark platform specific. I don't know if dropping volatile would
help and don't want to add arm assembly without testing it so add arm
assembly if gcc would also produce suboptimal code.

That changes make different numbers on x64 which are following:

don't inline
conditional add
branched

real	0m1.080s
user	0m1.079s
sys	0m0.000s
builtin

real	0m1.079s
user	0m1.075s
sys	0m0.003s
branch
branched

real	0m1.031s
user	0m1.031s
sys	0m0.000s
builtin

real	0m0.848s
user	0m0.848s
sys	0m0.000s
sum
branched

real	0m1.155s
user	0m1.154s
sys	0m0.000s
builtin

real	0m1.003s
user	0m1.002s
sys	0m0.000s
inline outer call
conditional add
branched

real	0m0.618s
user	0m0.617s
sys	0m0.000s
builtin

real	0m0.771s
user	0m0.771s
sys	0m0.000s
branch
branched

real	0m0.618s
user	0m0.617s
sys	0m0.000s
builtin

real	0m0.618s
user	0m0.617s
sys	0m0.000s
sum
branched

real	0m0.693s
user	0m0.689s
sys	0m0.003s
builtin

real	0m0.694s
user	0m0.693s
sys	0m0.000s
inline inner call
conditional add
branched

real	0m0.800s
user	0m0.800s
sys	0m0.000s
builtin

real	0m0.694s
user	0m0.694s
sys	0m0.000s
branch
branched

real	0m0.618s
user	0m0.614s
sys	0m0.003s
builtin

real	0m0.618s
user	0m0.617s
sys	0m0.000s
sum
branched

real	0m0.695s
user	0m0.694s
sys	0m0.000s
builtin

real	0m0.747s
user	0m0.747s
sys	0m0.000s
tigth loop
conditional add
branched

real	0m0.227s
user	0m0.226s
sys	0m0.000s
builtin

real	0m0.255s
user	0m0.254s
sys	0m0.000s
branch
branched

real	0m0.225s
user	0m0.224s
sys	0m0.000s
builtin

real	0m0.234s
user	0m0.233s
sys	0m0.000s
sum
branched

real	0m0.270s
user	0m0.269s
sys	0m0.000s
builtin

real	0m0.391s
user	0m0.390s
sys	0m0.000s

Attachment: ft.c
Description: Text document

Attachment: tf
Description: Text document

References:
- Re: [PATCH] Inline C99 math functions
  - From: Joseph Myers
- RE: [PATCH] Inline C99 math functions
  - From: Wilco Dijkstra
- RE: [PATCH] Inline C99 math functions
  - From: Joseph Myers
- RE: [PATCH] Inline C99 math functions
  - From: Wilco Dijkstra
- Re: [PATCH] Inline C99 math functions
  - From: OndÅej BÃlka
- RE: [PATCH] Inline C99 math functions
  - From: Wilco Dijkstra
- Re: [PATCH] Inline C99 math functions
  - From: OndÅej BÃlka
- RE: [PATCH] Inline C99 math functions
  - From: Wilco Dijkstra
- Re: [PATCH] Inline C99 math functions
  - From: OndÅej BÃlka
- RE: [PATCH] Inline C99 math functions
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]