This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] faster memcpy on x64.
- From: Andreas Jaeger <aj at suse dot com>
- To: libc-alpha at sourceware dot org, "H.J. Lu" <hjl dot tools at gmail dot com>
- Date: Thu, 09 May 2013 16:27:29 +0200
- Subject: Re: [PATCH] faster memcpy on x64.
- References: <20130427221620 dot GA16537 at domone dot kolej dot mff dot cuni dot cz>
Intel, AMD developer, do you have any feedback on the performance of
this patch? Please provide it until Monday, 13th - otherwise we've
waited long enough on this one and I think this can go in now.
Ondrej, consider this approved (with the small change below) and commit
on the 14th unless somebody vetoes.
On 04/28/2013 12:16 AM, OndÅej BÃlka wrote:
Hi,
I was occupied for last few week on analyzing memcpy and memset and I
have better implementations than current. This patch is about memcpy.
Benchmark results are at
http://kam.mff.cuni.cz/~ondra/memcpy_profile.html
or archived at
http://kam.mff.cuni.cz/~ondra/memcpy_profile_result27_04_13.tar.bz2
I tried to modify this for memmove and I found that additional
cost is close to zero when not overlapping.
So this implementation can be aliased to memmove.
Important part there is test of memcpy in hooked gcc which shows small
but real speedup. A memcpy_new 1) is faster on newer processors while
memcpy_new_small on slower.
Could we test 2) on wider range of usecases and report results?
Here we hit fact that strings in practice are small and there is we hit
latency to get data.
The microbenchmarks tests look much better.
Main speedup is obtained by avoiding computed loops and simplify control
flow for better speculative execution. 1)
This gives 20% speedup for 32-1000 byte strings.
Second is that loop that I use is in most architectures asymptoticaly
faster than gcc one for data in L1, L2, L3 cache.
When data is in memory then memory is bottleneck and choice of
implementation can give at most 1%.
I tested avx version which is slower on current processors due fact that
it is faster to load high and low half separately.
1) Except core2,athlon where I need even simpler control flow
(memcpy_new_small) to get that speedup.
I attached file from which I generated this patch. There are few
mistakes made by gcc, I could post diff againist vanilla version.
I did not tried optimize for atom yet so I keep ifunc for it.
Passes testsuite. OK for 2.18?
Ondra
1) File variant/memcpy_new_small.s in 2)
2) http://kam.mff.cuni.cz/~ondra/memcpy_profile27_04_13.tar.bz2
* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: New file.
* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Add
__memcpy_sse2_unaligned ifunc selection.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines):
Add memcpy-sse2-unaligned.S.
sysdeps/x86_64/multiarch/ifunc-impl-list.c: __memcpy_sse2_unaligned.
The last entry should be:
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Add: __memcpy_sse2_unaligned.
thanks,
Andreas
--
Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 NÃrnberg, Germany
GF: Jeff Hawn,Jennifer Guild,Felix ImendÃrffer,HRB16746 (AG NÃrnberg)
GPG fingerprint = 93A3 365E CE47 B889 DF7F FED1 389A 563C C272 A126