This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [BZ #18880] Wrong selector in x86_64/multiarch/memcpy.S


On Mon, Aug 31, 2015 at 08:41:44AM -0700, H.J. Lu wrote:
> On Sat, Aug 29, 2015 at 12:52 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> > On Fri, Aug 28, 2015 at 06:05:53AM -0700, H.J. Lu wrote:
> >> For x86-64 memcpy/mempcpy, we choose the best implementation by the
> >> order:
> >>
> >> 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
> >> 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
> >> 3. __memcpy_sse2 if SSSE3 isn't available.
> >> 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
> >> 5. __memcpy_ssse3
> >>
> >> In libc.a and ld.so, we choose __memcpy_sse2_unaligned which is optimized
> >> for current Intel and AMD x86-64 processors.
> >>
> >> OK for master?
> >>
> > This has several unrelated changes. First is moving files to make new
> > default which looks ok but produces large diff that hides other changes.
> >
> > Second is mempcpy support. I also had some patches that do it along with with better
> > memcpy, I could resend these. As this one it looks reasonable but we
> > could do better. As mempcpy is rarely used a function to jump after we
> > setup return value should be used.
> >
> > Third is ifunc selection. Problem is that what you do is wrong. I had in
> > my todo list comment: Fix ssse3 memcpy and remove ifunc hack.
> >
> > There were some problems on atom that I don't recall but when I look to
> > graph of it a sse2 implmentation looks better till around 400 bytes.
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/atom/memcpy_profile/results_gcc/result.html
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/atom/memcpy_profile/results_rand/result.html
> >
> > When I tested it on around half of applications on core2 memcpy_ssse3 was slower
> > than even memcpy_sse2 so I wrote separate patch to fix that performance regression.
> > So this is more theoretical as while ssse3 is faster on longer inputs a sse2 and
> > sse2_unaligned are faster on shorter inputs. So changing it now would
> > help some which use mainly long ones but harm other applications.
> >
> > The __memcpy/mempcpy should be just deleted, we set that bit only
> > for i3/i5/i7 where also set fast_unaligned_load so these are not used. A
> > memmove is case as well when we check my patch that implements unaligned
> > memmove.
> >
> > Finally when there is no ssse3 then memcpy_sse2_unaligned is faster
> > again as memcpy_sse2 name lie. It doesn't do sse2 moves, only 8 byte
> > ones that makes it around 30% slower on larger inputs for phenomII that
> > I tested and also slower on gcc workload. Again I wrote patch that fixes
> > that by adding variant that does sse2 loads/stores with shifts. Then we
> > could drop sse2 default.
> >
> > I did quick retesting here
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_old.html
> > with this profiler where I for simplicity used memmove variants of ssse3 routines.
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_old290815.tar.bz2
> >
> >
> 
> I will hold off my memcpy patch until SSE3 issue is sorted out.
> 
> Thanks.
> 
Thanks, I am back from vacation so I will send what I have. I had some
ideas to improve it more but it turned out that conditional moves don't
work for memcpy.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]