This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] x86-64: Optimize e_expf with FMA [BZ #21912]

From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
To: libc-alpha at sourceware dot org
Date: Wed, 16 Aug 2017 12:00:02 -0300
Subject: Re: [PATCH] x86-64: Optimize e_expf with FMA [BZ #21912]
Authentication-results: sourceware.org; auth=none
References: <6d943008-ffb1-af02-ddf3-f287fe99ff1f@redhat.com> <aedfbd8e-8973-ddff-9bbf-9eac89c1d633@linux.intel.com> <59945D11.3050908@arm.com>


On 16/08/2017 11:56, Szabolcs Nagy wrote:
> On 16/08/17 15:31, Arjan van de Ven wrote:
>> On 8/16/2017 7:04 AM, Carlos O'Donell wrote:
>>> On 08/16/2017 09:34 AM, H.J. Lu wrote:
>>>> FMA optimized e_expf improves performance by more than 50% on Skylake.
>>>>
>>>> Any comments?
>>>
>>> Exactly how much of e_expf-fma.S do you need to achieve that 50% speedup?
>>
>> the core "fast path"
>> (the bit after    /* Main path: here if 2^(-28)<=|x|<125*log(2) */ )
>>
>>
>>>
>>> How does this algorithm compare to what is already implemented for e_expf?
>>
>> I started with the SSE version of that e_expf, turned it into AVX, used FMA where possible and fixed a few
>> glass jaws in the fast path that you hit on skylake.
>>
>> the slow path is more a direct 1:1 translation from SSE to AVX (because mixing SSE and AVX
>> is generally a bad idea)
>>
> 
> based on my benchmarks portable c code can
> easily beat the hand written sse asm
> (i haven't tested with avx+fma though).
> 
> the idea is that the x86 asm has overkill
> precision (very close to 0.5 ulp error, but
> not correctly rounded), we can debate this
> later, but i think the polynomial can be
> reduced and there should not be much difference
> between asm and c performance (only the
> round/convert to int operation is tricky:
> for different targets the optimal code is
> different, but that can be a target specific
> macro hook).
> 
> anyway i posted my code to the arm
> optimized-routines github repo, i'll start
> posting the patches to glibc soon.
> 
> (one of the reasons posting glibc patches is
> difficult is the nonsensical target specific
> asm codes and ifunc resolvers that break when
> i update the generic code in a way that
> bypasses the wrapper function which is another
> source of improvements.)
> 

Yes, the include of generic implementation for ifunc default version could 
use some cleanup.  However mostly, if not all, can be checked by
build-many-glibc.py (it would take time though).

References:
- Re: [PATCH] x86-64: Optimize e_expf with FMA [BZ #21912]
  - From: Carlos O'Donell
- Re: Re: [PATCH] x86-64: Optimize e_expf with FMA [BZ #21912]
  - From: Arjan van de Ven
- Re: [PATCH] x86-64: Optimize e_expf with FMA [BZ #21912]
  - From: Szabolcs Nagy

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]