This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Fwd: Re: [PATCH] Remove unnecessary IFUNC dispatch for __memset_chk.


[I *was* going to attach a nice graph of this data to this message, but
apparently the mailing list won't let me do that.]

On 08/11/2015 05:32 AM, OndÅej BÃlka wrote:
> Its actually very easy to see impact of plt bypassing
> there. For memset you have problem that calloc(1024) could be
> considerably faster, you need just read benchtests below. As
> builtin_memset got compliled into jmp memset@plt it shows that overhead
> isn't noticable. Same with memcpy which gets called in realloc with big
> arguments. I could dig more cases.
> 
>  builtin_memset  simple_memset __memset_sse2 __memset_avx2
> 
> Length    1, alignment  0, c -65: 15.4062 7.17188 10.9375 10.2031
> Length    2, alignment  0, c -65: 13.5  8.89062 11.3125 9.64062
> Length    4, alignment  0, c -65: 12.9844 12.0938 11.0312 8.84375
> Length    8, alignment  0, c -65: 11.7344 15.5469 10.6094 7.64062
> Length   16, alignment  0, c -65: 15.0781 23.9688 10.3281 10.4219
> Length   32, alignment  0, c -65: 14.7031 37.4219 9.57812 10.7031
> Length   64, alignment  0, c -65: 14  80.8438 9.6875  9.21875
> Length  128, alignment  0, c -65: 18.1094 137.812 15.3125 19.0781
> Length  256, alignment  0, c -65: 15.2656 272.141 21.2656 11.7812
> Length  512, alignment  0, c -65: 19.2656 502.469 34.3594 18.6562
> Length 1024, alignment  0, c -65: 32.7188 940.766 63.6719 31.2812
> Length 2048, alignment  0, c -65: 61.7188 1880.83 121 60.7812
> Length 4096, alignment  0, c -65: 118.172 3718.7  239.469 118.641
> Length 8192, alignment  0, c -65: 255.141 7373.38 469.422 252.125
> Length 16384, alignment  0, c -65:  484.359 15742.4 1478.39 481.812
> Length 32768, alignment  0, c -65:  990.562 29551.1 1978.39 966.047
> Length 65536, alignment  0, c -65:  6163.86 64354.7 5663.06 5779.97
> Length 131072, alignment  0, c -65: 12244.4 129994  11414.3 11640.7

If I understood you correctly, the difference between builtin_memset and
__memset_avx2 should be exactly the PLT overhead, and the other two are
just in there for comparison.  I would not call a difference of
approximately four microseconds on short calls to memset
"isn't noticeable".  (Are these numbers microseconds?)

A microbenchmark cannot address the question of whether having both the
SSE2 and AVX2 implementations of memset in the cache measurably harms
*overall* performance.

I observe that AVX only starts to be a consistent win vs SSE2 at about
256 bytes.  Very small memsets should of course be being inlined, but I
wonder if a unified implementation that doesn't bother with AVX2 for
fewer than 256 bytes, and internally tests the CPU features for larger
blocks, would wind up being better overall.  (HJ just posted patches
that would make testing the CPU features every single time quite cheap.)
If that turns out to be the case for memcpy and memmove as well, maybe
this entire IFUNC mess could just be junked.

zw




Attachment: signature.asc
Description: OpenPGP digital signature


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]