This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Inline C99 math functions


On Wed, Jun 17, 2015 at 04:24:46PM +0100, Wilco Dijkstra wrote:
> > >
> > > > > __fpclassify_t:	8.76	7.04
> > > > > fpclassify_t:	4.91	5.17
> > > >
> > > > > __isnormal_inl_t:	8.77	7.16
> > > > > isnormal_t:		3.16	3.17
> > > >
> > > > Where did you get inline? I couldn't find it anywhere. Also such big
> > > > number for inline implementation is suspect
> > >
> > > It does (__fpclassify (x) == FP_NORMAL) like math.h which is obviously a bad
> > > idea and the reason for the low performance. Although the GCC isnormal builtin
> > > is not particularly fast, it still beats it by more than a factor of 2.
> > >
> > No, bad idea was not inlining fpclassify, that affects most of performance difference.
> > There is also problem that glibcdev/glibc/sysdeps/ieee754/dbl-64/s_fpclassify.c is bit slow as
> > it tests unlikely cases first but that is secondary.
> 
> Even with the inlined fpclassify (inl2 below), isnormal is slower:
> 
> __isnormal_inl2_t:	1.25	3.67
> __isnormal_inl_t:	4.59	2.89
> isnormal_t:	1	1
> 
> So using a dedicated builtin for isnormal is important.
>
That makes result identical to one of isnan. That its slower is bug in
fpclassify which should first check for normal, then do unlikely checks.

 
> > > It's certainly correct, but obviously different microarchitectures will show
> > > different results. Note the GLIBC private inlines are not particularly good.
> > >
> > No, problem is that different benchmarks show different results on same
> > architecture. To speed things up run following to test all cases of
> > environment. Run attached tf script to get results on arm.
> 
> I tried, but I don't think this is a good benchmark - you're not measuring
> the FP->int move for the branched version, and you're comparing the signed
> version of isinf vs the builtin which does isinf_ns.
> 
Wilco that isinf is signed is again completely irrelevant. gcc is smart.
It get expanded into 
if (foo ? (bar ? 1 : -1) : 0)
that gcc simplifies to
if (foo)
so it doesn't matter that checking sign would take 100 cycles as its
deleted code.

Also as it could make branched version only slower when it beats builtin
then also nonsigned one would beat builtin.

And do you have assembly to show it doesn't measure move or its just
your guess? On x64 it definitely measures move and I could add that gcc
messes that bit by moving several times. objdump -d 
on gcc   -DT2 -DBRANCHED  -DI1="__attribute((always_inline))" -DI2="__attribute__((always_inline))" ft.c   -c
clearly shows that conversion is done several times.

 24c:	48 8b 45 f0          	mov    -0x10(%rbp),%rax
 250:	48 01 d0             	add    %rdx,%rax
 253:	f2 0f 10 00          	movsd  (%rax),%xmm0
 257:	f2 0f 11 45 e8       	movsd  %xmm0,-0x18(%rbp)
 25c:	48 8d 45 c8          	lea    -0x38(%rbp),%rax
 260:	48 89 45 e0          	mov    %rax,-0x20(%rbp)
 264:	f2 0f 10 45 e8       	movsd  -0x18(%rbp),%xmm0
 269:	f2 0f 11 45 d8       	movsd  %xmm0,-0x28(%rbp)
 26e:	f2 0f 10 45 d8       	movsd  -0x28(%rbp),%xmm0
 273:	f2 0f 11 45 c0       	movsd  %xmm0,-0x40(%rbp)
 278:	48 8b 45 c0          	mov    -0x40(%rbp),%rax
 27c:	48 89 45 d0          	mov    %rax,-0x30(%rbp)
 280:	48 8b 45 d0          	mov    -0x30(%rbp),%rax
 284:	48 8d 14 00          	lea    (%rax,%rax,1),%rdx
 288:	48 b8 00 00 00 00 00 	movabs $0xffe0000000000000,%rax
 28f:	00 e0 ff 
 292:	48 39 c2             	cmp    %rax,%rdx
 295:	75 1e                	jne    2b5 <main+0xe3>
 297:	48 b8 00 00 00 00 00 	movabs $0x7ff0000000000000,%rax
 29e:	00 f0 7f 
 2a1:	48 39 45 d0          	cmp    %rax,-0x30(%rbp)
 2a5:	75 07                	jne    2ae <main+0xdc>
 2a7:	b8 01 00 00 00       	mov    $0x1,%eax
 2ac:	eb 0c                	jmp    2ba <main+0xe8>
 2ae:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
 2b3:	eb 05                	jmp    2ba <main+0xe8>
 2b5:	b8 00 00 00 00       	mov    $0x0,%eax
 2ba:	85 c0                	test   %eax,%eax
 2bc:	74 05                	je     2c3 <main+0xf1>


So will you publish results or not as it would show your builtins in
unfavorable ligth?


> > Which doesn't matter. As gcc optimized unneded checks away you won't do
> > unneeded checks. As using:
> > 
> > __builtin_fpclassify (FP_NAN, FP_INFINITE,           \
> >      FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0);
> >   return result == FP_INFINITE || result == FP_NAN;
> > 
> > is slower than:
> > 
> >  return __builtin_isinf (x) ||  __builtin_isnan (x);
> > 
> > Your claim is false, run attached tf2 script to test.
> 
> That's not what I am seeing, using two explicit isinf/isnan calls (test2) is faster
> than inlined fpclassify (test1):
> 
> __fpclassify_test2_t:	1	4.41
> __fpclassify_test1_t:	1.23	4.66
> 
You need to run my benchmark. You get different results if thats inside
if or not so I want to also know my results.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]