This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: New x86-64 memcpy

From: Rene Rebe <rene at exactcode dot de>
To: libc-alpha at sourceware dot org
Cc: "Menezes, Evandro" <evandro dot menezes at amd dot com>, "Meissner, Michael" <michael dot meissner at amd dot com>, "H. J. Lu" <hjl at lucon dot org>
Date: Wed, 21 Feb 2007 16:23:24 +0100
Subject: Re: New x86-64 memcpy
References: <1449F58C868D8D4E9C72945771150BDF0173A222@SAUSEXMB1.amd.com>

On Saturday 17 February 2007 00:38:46 Menezes, Evandro wrote:
> I implemented a new version of memcpy for x86-64 that provides an overall performance improvement over the current one on both AMD and Intel processors.
> 
> It has several algorithms tuned for specific block size ranges, considering the sizes of the cache subsystems.  For instance, making use of repeated string instructions, software prefetching and streaming stores.
> 
> As it uses several algorithms depending on the block size, the code is fairly long.  But given that ld.so doesn't really need as many algorithms, at build-time a specialized version for ld.so has only a handful of worthy algorithms.
> 
> In addition to the source-code patches, I also attached the resulting data obtained on a 2.4GHz Athlon64 with DDR2-800 RAM and on a 3GHz Core2 with DDR2-533.  The file memcpy-opteron-old.txt has the original output of string/test-memcpy on the Athlon64 system and the file memcpy-opteron-new.txt the output using the new routine.  The files memcpy-core2-old.txt and memcpy-core2-new.txt contain the same results but on the Core2 system.  
> 
> I also plotted the performance of the new routine relative to the old one (where a ratio of 1 stands for performance parity and >1 for performance improvement) in memcpy-opteron-new-memcpy-opteron-old.png for the Athlon64 system and in memcpy-core2-new-memcpy-core2-old.png for the Core2 system.  
> 
> Because Core2's time-stamp counter is driven by the front-side bus clock, for some tiny blocks it may not be incremented, resulting in a count difference of zero, making such performance measurements problematic.
> 
> I separated the source-code patches in two, one containing the changes to memcpy.S et al, memcpy.diff:
> 
> 2007-02-16 Evandro Menezes <evandro.menezes@amd.com>
> 
> 	* sysdeps/x86_64/memcpy.S: new code to handle more block size ranges.
> 	* sysdeps/x86_64/mempcpy.S: modified macro definition.
> 
> And another with the additions to detect the sizes of the caches, rtld.diff:
> 
> 2007-02-16 Evandro Menezes <evandro.menezes@amd.com>
> 
> 	* sysdeps/unix/sysv/linux/x86_64/dl-procinfo.c: included new file.
> 	* sysdeps/x86_64/dl-machine.h: added code to detect caches sizes.
> 	* sysdeps/x86_64/dl-procinfo.c: new file.
> 	* sysdeps/x86_64/elf/rtld-global-offsets.sym: ditto.
> 	* sysdeps/x86_64/Makefile: added rtld-global-offsets.sym.
> 

Despite some whitespacing +1 from me.

Still any reason to hold this off?

Yours,

-- 
  René Rebe - ExactCODE GmbH - Europe, Germany, Berlin
  http://exactcode.de | http://t2-project.org | http://rene.rebe.name
  +49 (0)30 / 255 897 45

References:
- New x86-64 memcpy
  - From: Menezes, Evandro

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]