This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Optimized generic expf and exp2f


On 06/09/17 14:41, Arjan van de Ven wrote:
> On 9/6/2017 6:16 AM, Wilco Dijkstra wrote:
>> Arjan van de Ven wrote:
>>>
>>> I'm seeing a 16% throughput increase (not 1.5x) but still impressive.
>>
>> Was that using the expf trace input or something else? And with wrapper?
>>
>>> I do see different numerical answers between the two (I had to disable
>>> the code in my bench that detects differences) and sampling a few
>>> it seems that the C code is a little bit less accurate in places,
>>> likely a simpler polynomal.
>>> (for example for  20.636783599853515625    as input)
>>
>> It's still way more accurate than necessary. The only reason is to
>> minimize ULP error for non-nearest rounding modes. If you don't
>> care about worst-case ULP for non-standard rounding modes, the
>> polynomial can be further simplified within 1ULP max error in round
>> to nearest.
> 
> interesting; it takes 2 independent FP adds and a compare (in C) to detect nearest rounding
> being in effect (which in time can overlap with the float->double conversion)
> so if there's an option to reduce the algorithm by more than that for a fast
> path...
> 
> (also, some CPUs (like newer Intel) support an instruction prefix encoding to force
> rounding modes on a FP instruction independent of the global rounding mode,
> which at some point maybe should be a gcc pragma or attribute or something,
> and then used in such C code)
> 

i don't think reducing the polynomial (from order 3 to order 2)
is possible without bigger lookup table, if less accuracy is
enough then reducing the table size is possible though:

poly order / table len / ulp error / non-nearest ulp error (rounded)
2          / 64        / 0.61      /
2          / 128       / 0.51      /
2          / 256       / 0.502     /
3          / 8         / 0.91      / > 10
3          / 16        / 0.526     / 2
3          / 32        / 0.502     / 1
3          / 64        / 0.5001    / 1
4          / 8         / 0.54      /
4          / 16        / 0.501     /
4          / 32        / 0.50004   /
4          / 64        / 0.5       /

the c code uses order=3/table=32, the x86_64 asm uses order=4/table=64


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]