This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: New x86-64 memcpy
- From: Rene Rebe <rene at exactcode dot de>
- To: libc-alpha at sourceware dot org
- Cc: "Menezes, Evandro" <evandro dot menezes at amd dot com>, "Meissner, Michael" <michael dot meissner at amd dot com>, "H. J. Lu" <hjl at lucon dot org>
- Date: Wed, 21 Feb 2007 16:23:24 +0100
- Subject: Re: New x86-64 memcpy
- References: <1449F58C868D8D4E9C72945771150BDF0173A222@SAUSEXMB1.amd.com>
On Saturday 17 February 2007 00:38:46 Menezes, Evandro wrote:
> I implemented a new version of memcpy for x86-64 that provides an overall performance improvement over the current one on both AMD and Intel processors.
>
> It has several algorithms tuned for specific block size ranges, considering the sizes of the cache subsystems. For instance, making use of repeated string instructions, software prefetching and streaming stores.
>
> As it uses several algorithms depending on the block size, the code is fairly long. But given that ld.so doesn't really need as many algorithms, at build-time a specialized version for ld.so has only a handful of worthy algorithms.
>
> In addition to the source-code patches, I also attached the resulting data obtained on a 2.4GHz Athlon64 with DDR2-800 RAM and on a 3GHz Core2 with DDR2-533. The file memcpy-opteron-old.txt has the original output of string/test-memcpy on the Athlon64 system and the file memcpy-opteron-new.txt the output using the new routine. The files memcpy-core2-old.txt and memcpy-core2-new.txt contain the same results but on the Core2 system.
>
> I also plotted the performance of the new routine relative to the old one (where a ratio of 1 stands for performance parity and >1 for performance improvement) in memcpy-opteron-new-memcpy-opteron-old.png for the Athlon64 system and in memcpy-core2-new-memcpy-core2-old.png for the Core2 system.
>
> Because Core2's time-stamp counter is driven by the front-side bus clock, for some tiny blocks it may not be incremented, resulting in a count difference of zero, making such performance measurements problematic.
>
> I separated the source-code patches in two, one containing the changes to memcpy.S et al, memcpy.diff:
>
> 2007-02-16 Evandro Menezes <evandro.menezes@amd.com>
>
> * sysdeps/x86_64/memcpy.S: new code to handle more block size ranges.
> * sysdeps/x86_64/mempcpy.S: modified macro definition.
>
> And another with the additions to detect the sizes of the caches, rtld.diff:
>
> 2007-02-16 Evandro Menezes <evandro.menezes@amd.com>
>
> * sysdeps/unix/sysv/linux/x86_64/dl-procinfo.c: included new file.
> * sysdeps/x86_64/dl-machine.h: added code to detect caches sizes.
> * sysdeps/x86_64/dl-procinfo.c: new file.
> * sysdeps/x86_64/elf/rtld-global-offsets.sym: ditto.
> * sysdeps/x86_64/Makefile: added rtld-global-offsets.sym.
>
Despite some whitespacing +1 from me.
Still any reason to hold this off?
Yours,
--
René Rebe - ExactCODE GmbH - Europe, Germany, Berlin
http://exactcode.de | http://t2-project.org | http://rene.rebe.name
+49 (0)30 / 255 897 45