This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

New x86-64 memcpy


I implemented a new version of memcpy for x86-64 that provides an overall performance improvement over the current one on both AMD and Intel processors.

It has several algorithms tuned for specific block size ranges, considering the sizes of the cache subsystems.  For instance, making use of repeated string instructions, software prefetching and streaming stores.

As it uses several algorithms depending on the block size, the code is fairly long.  But given that ld.so doesn't really need as many algorithms, at build-time a specialized version for ld.so has only a handful of worthy algorithms.

In addition to the source-code patches, I also attached the resulting data obtained on a 2.4GHz Athlon64 with DDR2-800 RAM and on a 3GHz Core2 with DDR2-533.  The file memcpy-opteron-old.txt has the original output of string/test-memcpy on the Athlon64 system and the file memcpy-opteron-new.txt the output using the new routine.  The files memcpy-core2-old.txt and memcpy-core2-new.txt contain the same results but on the Core2 system.  

I also plotted the performance of the new routine relative to the old one (where a ratio of 1 stands for performance parity and >1 for performance improvement) in memcpy-opteron-new-memcpy-opteron-old.png for the Athlon64 system and in memcpy-core2-new-memcpy-core2-old.png for the Core2 system.  

Because Core2's time-stamp counter is driven by the front-side bus clock, for some tiny blocks it may not be incremented, resulting in a count difference of zero, making such performance measurements problematic.

I separated the source-code patches in two, one containing the changes to memcpy.S et al, memcpy.diff:

2007-02-16 Evandro Menezes <evandro.menezes@amd.com>

	* sysdeps/x86_64/memcpy.S: new code to handle more block size ranges.
	* sysdeps/x86_64/mempcpy.S: modified macro definition.

And another with the additions to detect the sizes of the caches, rtld.diff:

2007-02-16 Evandro Menezes <evandro.menezes@amd.com>

	* sysdeps/unix/sysv/linux/x86_64/dl-procinfo.c: included new file.
	* sysdeps/x86_64/dl-machine.h: added code to detect caches sizes.
	* sysdeps/x86_64/dl-procinfo.c: new file.
	* sysdeps/x86_64/elf/rtld-global-offsets.sym: ditto.
	* sysdeps/x86_64/Makefile: added rtld-global-offsets.sym.

-- 
_______________________________________________________
Evandro Menezes               AMD            Austin, TX

Attachment: memcpy-opteron-new-memcpy-opteron-old.png
Description: memcpy-opteron-new-memcpy-opteron-old.png

Attachment: memcpy-core2-new-memcpy-core2-old.png
Description: memcpy-core2-new-memcpy-core2-old.png

Attachment: rtld.diff.bz2
Description: rtld.diff.bz2

Attachment: memcpy.diff.bz2
Description: memcpy.diff.bz2

Attachment: test-memcpy.tar.bz2
Description: test-memcpy.tar.bz2


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]