On 07/31/2017 05:30 PM, Patrick McGehearty wrote:
On 7/31/2017 3:12 PM, Carlos O'Donell wrote:
On 07/31/2017 03:47 PM, David Miller wrote:
From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
Date: Mon, 31 Jul 2017 15:39:29 -0400
This PATCH is intended to improve exp() and expf() performance on Sparc.
These changes will only be active on Sparc platforms and only for
those platforms that support HWCAP_SPARC_CRYPTO (niagara4 and later).
Can you explain which instructions exactly help make the compiled
C code for exp() and expf() faster instead of being vague like
this?
Wouldn't the new C code you are adding be faster on other CPUs as
well, even without gcc generating instructions for Niagara 4 and
later?
... I would also like to see the results of the glibc microbenchmark
*before* and *after* the patches. We have *specific* microbenchmarks
for lots of math functions.
I'm assuming you are referring to the results of running
"make bench". I found some exp results in benchtests/bench.out
On my test machine (a single core VM in a Sparc S7 running at 4.3GHz):
My apologies, I realize I was being too harsh in my last email.
We look forward to more posts from Oracle to help out with SPARC.
Yes, this is exactly what you need to do, run `make bench` to get
data before and after the changes.
ieee754 (before)
"exp": {
"": {
"duration": 4.34656e+10,
"iterations": 8.224e+06,
"max": 16550.3,
"min": 400.426,
"mean": 5285.21
},
new sparc code (after)
"exp": {
"": {
"duration": 4.25365e+10,
"iterations": 6.07034e+08,
"max": 183.446,
"min": 27.095,
"mean": 70.0726
},
I have to say that the ratio of 5285/70 = 75x speedup seems way
too optimistic for my new code. I have not investigated the reason
for the apparently super slow max value.
It looks like you've made significant performance gains for the input
values being tested by the microbenchmark.
One reason you might see 75x is that you've eliminated a slow path
that used to go through multi-precision arithmetic?
I wrote my own standalone tests which tested a variety of different values for x
with a repeat factor of 500 (i.e. time the computation of each value 500 times).
A sample of results:
x= 10 ieee754 exp(x) = 172 nsec; new exp(x) = 37 nsec (or 51 nsec without t4 optimizations)
x=-10 ieee754 exp(x) = 172 nsec; new exp(x) = 37 nsec
x=-0.9 ieee754 exp(x) = 172 nsec; new exp(x) = 19 nsec
x= 0.9 ieee754 exp(x) = 172 nsec; new exp(x) = 19 nsec
x= 0 ieee754 exp(x) = 116 nsec; new exp(x) = 8 nsec
The expf() is around 200 nsec for ieee754.
The new expf() time is typically around 12-13 nsec/call.
Perhaps we need more indicative inputs added to benchtests/exp-inputs?