This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Hi, Jakub. > > They had 8 bytes each in order to allow direct comparisons > with the count > > in a register without having to load the value. Even if in > memcpy they > > can be used as 4-byte variables, I have other routines that > would benefit > > from them being 8 bytes long. > > In the last round of routines you sent I haven't seen that, > but sure, if > some var has justification for being 64-bit, so be it. The important > is just (%rip) addressing. Got it. It actually made the patch much leaner, as it doesn't touch on RTLD stuff anymore. > > I guess that using the red zone is better. As the routine > has several > > exit points to improve performance, after each one new CFI > directives > > would have to be added, which complicates maintaining the code. > > Even with red zone you need some CFI directives (which say > where %r12/$r13/%r14 > have been saved or cfi_restore for them), but don't need any CFA > adjustments. I chose for using the red zone with the CFI directives. > > I'll double-check that RDI has the expected value always. > Otherwise, I'll > > just use an entry in the red zone. > > I believe so. L(1{,a,b,c,d,loop}) always increment %rdi by > the size they > stored into (%rdi). All other ret's are preceeded by jnz > L(1), which relies > on %rdi pointing after the last byte stored. Indeed. The tail code is tad harder to read though. Again, in addition to the source-code patches, I also attached the resulting data obtained on a 2.4GHz Athlon64 with DDR2-800 RAM and on a 3GHz Core2 with DDR2-533. The file memcpy-opteron-old.txt has the original output of string/test-memcpy on the Athlon64 system and the file memcpy-opteron-new.txt the output using the new routine. The files memcpy-core2-old.txt and memcpy-core2-new.txt contain the same results but on the Core2 system. I also plotted the performance of the new routine relative to the old one (where a ratio of 1 stands for performance parity and >1 for performance improvement) in movs-opteron-new-movs-opteron-old.png for the Athlon64 system and in movs-core2-new-movs-core2-old.png for the Core2 system. 2007-05-04 Evandro Menezes <evandro.menezes@amd.com> * sysdeps/x86_64/memcpy.S: new code to handle more block size ranges. * sysdeps/x86_64/mempcpy.S: modified macro definition. * sysdeps/unix/sysv/linux/x86_64/sysconf.c: moved code to detect caches sizes... * sysdeps/x86_64/cacheinfo.c: ... here. * sysdeps/x86_64/Makefile: added cacheinfo.c. Could you please review it? Thanks, -- _______________________________________________________ Evandro Menezes AMD Austin, TX
Attachment:
movs-core2-new-movs-core2-old-ratio.png
Description: movs-core2-new-movs-core2-old-ratio.png
Attachment:
movs-opteronf-new-movs-opteronf-old-ratio.png
Description: movs-opteronf-new-movs-opteronf-old-ratio.png
Attachment:
memcpy.diff.bz2
Description: memcpy.diff.bz2
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |