This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [BZ #18880] Wrong selector in x86_64/multiarch/memcpy.S


On Fri, Aug 28, 2015 at 06:05:53AM -0700, H.J. Lu wrote:
> For x86-64 memcpy/mempcpy, we choose the best implementation by the
> order:
> 
> 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
> 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
> 3. __memcpy_sse2 if SSSE3 isn't available.
> 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
> 5. __memcpy_ssse3
> 
> In libc.a and ld.so, we choose __memcpy_sse2_unaligned which is optimized
> for current Intel and AMD x86-64 processors.
> 
> OK for master?
> 
This has several unrelated changes. First is moving files to make new
default which looks ok but produces large diff that hides other changes.

Second is mempcpy support. I also had some patches that do it along with with better
memcpy, I could resend these. As this one it looks reasonable but we
could do better. As mempcpy is rarely used a function to jump after we
setup return value should be used.

Third is ifunc selection. Problem is that what you do is wrong. I had in
my todo list comment: Fix ssse3 memcpy and remove ifunc hack.

There were some problems on atom that I don't recall but when I look to
graph of it a sse2 implmentation looks better till around 400 bytes.
http://kam.mff.cuni.cz/~ondra/benchmark_string/atom/memcpy_profile/results_gcc/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/atom/memcpy_profile/results_rand/result.html

When I tested it on around half of applications on core2 memcpy_ssse3 was slower 
than even memcpy_sse2 so I wrote separate patch to fix that performance regression. 
So this is more theoretical as while ssse3 is faster on longer inputs a sse2 and
sse2_unaligned are faster on shorter inputs. So changing it now would
help some which use mainly long ones but harm other applications.

The __memcpy/mempcpy should be just deleted, we set that bit only
for i3/i5/i7 where also set fast_unaligned_load so these are not used. A
memmove is case as well when we check my patch that implements unaligned
memmove.

Finally when there is no ssse3 then memcpy_sse2_unaligned is faster
again as memcpy_sse2 name lie. It doesn't do sse2 moves, only 8 byte
ones that makes it around 30% slower on larger inputs for phenomII that
I tested and also slower on gcc workload. Again I wrote patch that fixes
that by adding variant that does sse2 loads/stores with shifts. Then we
could drop sse2 default.

I did quick retesting here
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_old.html
with this profiler where I for simplicity used memmove variants of ssse3 routines.
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_old290815.tar.bz2



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]