This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] [BZ #18880] Wrong selector in x86_64/multiarch/memcpy.S
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Mon, 31 Aug 2015 08:41:44 -0700
- Subject: Re: [PATCH] [BZ #18880] Wrong selector in x86_64/multiarch/memcpy.S
- Authentication-results: sourceware.org; auth=none
- References: <20150828130553 dot GA14875 at gmail dot com> <20150829075238 dot GB8463 at domone>
On Sat, Aug 29, 2015 at 12:52 AM, OndÅej BÃlka <email@example.com> wrote:
> On Fri, Aug 28, 2015 at 06:05:53AM -0700, H.J. Lu wrote:
>> For x86-64 memcpy/mempcpy, we choose the best implementation by the
>> 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
>> 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
>> 3. __memcpy_sse2 if SSSE3 isn't available.
>> 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
>> 5. __memcpy_ssse3
>> In libc.a and ld.so, we choose __memcpy_sse2_unaligned which is optimized
>> for current Intel and AMD x86-64 processors.
>> OK for master?
> This has several unrelated changes. First is moving files to make new
> default which looks ok but produces large diff that hides other changes.
> Second is mempcpy support. I also had some patches that do it along with with better
> memcpy, I could resend these. As this one it looks reasonable but we
> could do better. As mempcpy is rarely used a function to jump after we
> setup return value should be used.
> Third is ifunc selection. Problem is that what you do is wrong. I had in
> my todo list comment: Fix ssse3 memcpy and remove ifunc hack.
> There were some problems on atom that I don't recall but when I look to
> graph of it a sse2 implmentation looks better till around 400 bytes.
> When I tested it on around half of applications on core2 memcpy_ssse3 was slower
> than even memcpy_sse2 so I wrote separate patch to fix that performance regression.
> So this is more theoretical as while ssse3 is faster on longer inputs a sse2 and
> sse2_unaligned are faster on shorter inputs. So changing it now would
> help some which use mainly long ones but harm other applications.
> The __memcpy/mempcpy should be just deleted, we set that bit only
> for i3/i5/i7 where also set fast_unaligned_load so these are not used. A
> memmove is case as well when we check my patch that implements unaligned
> Finally when there is no ssse3 then memcpy_sse2_unaligned is faster
> again as memcpy_sse2 name lie. It doesn't do sse2 moves, only 8 byte
> ones that makes it around 30% slower on larger inputs for phenomII that
> I tested and also slower on gcc workload. Again I wrote patch that fixes
> that by adding variant that does sse2 loads/stores with shifts. Then we
> could drop sse2 default.
> I did quick retesting here
> with this profiler where I for simplicity used memmove variants of ssse3 routines.
I will hold off my memcpy patch until SSE3 issue is sorted out.