This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [BZ #18880] Wrong selector in x86_64/multiarch/memcpy.S


On Sat, Aug 29, 2015 at 12:52 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> On Fri, Aug 28, 2015 at 06:05:53AM -0700, H.J. Lu wrote:
>> For x86-64 memcpy/mempcpy, we choose the best implementation by the
>> order:
>>
>> 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
>> 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
>> 3. __memcpy_sse2 if SSSE3 isn't available.
>> 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
>> 5. __memcpy_ssse3
>>
>> In libc.a and ld.so, we choose __memcpy_sse2_unaligned which is optimized
>> for current Intel and AMD x86-64 processors.
>>
>> OK for master?
>>
> This has several unrelated changes. First is moving files to make new
> default which looks ok but produces large diff that hides other changes.
>
> Second is mempcpy support. I also had some patches that do it along with with better
> memcpy, I could resend these. As this one it looks reasonable but we
> could do better. As mempcpy is rarely used a function to jump after we
> setup return value should be used.
>
> Third is ifunc selection. Problem is that what you do is wrong. I had in
> my todo list comment: Fix ssse3 memcpy and remove ifunc hack.
>
> There were some problems on atom that I don't recall but when I look to
> graph of it a sse2 implmentation looks better till around 400 bytes.
> http://kam.mff.cuni.cz/~ondra/benchmark_string/atom/memcpy_profile/results_gcc/result.html
> http://kam.mff.cuni.cz/~ondra/benchmark_string/atom/memcpy_profile/results_rand/result.html
>
> When I tested it on around half of applications on core2 memcpy_ssse3 was slower
> than even memcpy_sse2 so I wrote separate patch to fix that performance regression.
> So this is more theoretical as while ssse3 is faster on longer inputs a sse2 and
> sse2_unaligned are faster on shorter inputs. So changing it now would
> help some which use mainly long ones but harm other applications.
>
> The __memcpy/mempcpy should be just deleted, we set that bit only
> for i3/i5/i7 where also set fast_unaligned_load so these are not used. A
> memmove is case as well when we check my patch that implements unaligned
> memmove.
>
> Finally when there is no ssse3 then memcpy_sse2_unaligned is faster
> again as memcpy_sse2 name lie. It doesn't do sse2 moves, only 8 byte
> ones that makes it around 30% slower on larger inputs for phenomII that
> I tested and also slower on gcc workload. Again I wrote patch that fixes
> that by adding variant that does sse2 loads/stores with shifts. Then we
> could drop sse2 default.
>
> I did quick retesting here
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_old.html
> with this profiler where I for simplicity used memmove variants of ssse3 routines.
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_old290815.tar.bz2
>
>

I will hold off my memcpy patch until SSE3 issue is sorted out.

Thanks.


-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]