This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
[I *was* going to attach a nice graph of this data to this message, but apparently the mailing list won't let me do that.] On 08/11/2015 05:32 AM, OndÅej BÃlka wrote: > Its actually very easy to see impact of plt bypassing > there. For memset you have problem that calloc(1024) could be > considerably faster, you need just read benchtests below. As > builtin_memset got compliled into jmp memset@plt it shows that overhead > isn't noticable. Same with memcpy which gets called in realloc with big > arguments. I could dig more cases. > > builtin_memset simple_memset __memset_sse2 __memset_avx2 > > Length 1, alignment 0, c -65: 15.4062 7.17188 10.9375 10.2031 > Length 2, alignment 0, c -65: 13.5 8.89062 11.3125 9.64062 > Length 4, alignment 0, c -65: 12.9844 12.0938 11.0312 8.84375 > Length 8, alignment 0, c -65: 11.7344 15.5469 10.6094 7.64062 > Length 16, alignment 0, c -65: 15.0781 23.9688 10.3281 10.4219 > Length 32, alignment 0, c -65: 14.7031 37.4219 9.57812 10.7031 > Length 64, alignment 0, c -65: 14 80.8438 9.6875 9.21875 > Length 128, alignment 0, c -65: 18.1094 137.812 15.3125 19.0781 > Length 256, alignment 0, c -65: 15.2656 272.141 21.2656 11.7812 > Length 512, alignment 0, c -65: 19.2656 502.469 34.3594 18.6562 > Length 1024, alignment 0, c -65: 32.7188 940.766 63.6719 31.2812 > Length 2048, alignment 0, c -65: 61.7188 1880.83 121 60.7812 > Length 4096, alignment 0, c -65: 118.172 3718.7 239.469 118.641 > Length 8192, alignment 0, c -65: 255.141 7373.38 469.422 252.125 > Length 16384, alignment 0, c -65: 484.359 15742.4 1478.39 481.812 > Length 32768, alignment 0, c -65: 990.562 29551.1 1978.39 966.047 > Length 65536, alignment 0, c -65: 6163.86 64354.7 5663.06 5779.97 > Length 131072, alignment 0, c -65: 12244.4 129994 11414.3 11640.7 If I understood you correctly, the difference between builtin_memset and __memset_avx2 should be exactly the PLT overhead, and the other two are just in there for comparison. I would not call a difference of approximately four microseconds on short calls to memset "isn't noticeable". (Are these numbers microseconds?) A microbenchmark cannot address the question of whether having both the SSE2 and AVX2 implementations of memset in the cache measurably harms *overall* performance. I observe that AVX only starts to be a consistent win vs SSE2 at about 256 bytes. Very small memsets should of course be being inlined, but I wonder if a unified implementation that doesn't bother with AVX2 for fewer than 256 bytes, and internally tests the CPU features for larger blocks, would wind up being better overall. (HJ just posted patches that would make testing the CPU features every single time quite cheap.) If that turns out to be the case for memcpy and memmove as well, maybe this entire IFUNC mess could just be junked. zw
Attachment:
signature.asc
Description: OpenPGP digital signature
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |