This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PATCH: Optimized memset for x86-64


Hi H.J, Ulrich,

>>I also plan to post results on the AMD Barcelona processor soon.
>> I plan to fix the issues pointed out by Ulrich in AMD's previous
>> submission and add an AMD path that addresses the performance issues
>> noted above.

I have tested the memset posted by H.J on AMD's Barcelona processor. I
have also cleaned up the memset tuned for AMD Barcelona as per the
review done by Ulrich as submitted at
http://sources.redhat.com/ml/libc-alpha/2007-08/msg00054.html (attached
patch 001-memset-amd.diff) and bootstrapped it on AMD64. The performance
of H.J's memset, AMD's memset and the original memset currently in glibc
on Barcelona is compared. The comparative performance data is attached
in memset_perf_data_comp.txt.

In order to come up with a blended memset for x86-64, it would be useful
to discuss the performance on AMD and Intel hardware and agree on the
design decisions of the common code path.

H.J's memset uses an integer jmp table for a 2 byte to 144 byte block.
This is a common blended code path for all x86-64 processors. Looking at
the performance in this range, here are some of the key observations
from memset_perf_data_comp.txt:

- At 1 byte, AMD's memset is ~23% slower than H.J's or the original.
- Between 2B and 43B, H.J's memset and AMD's memset are at par.
- Between 64B and 128B, H.J's memset is 8% to 21% slower at most blocks.

For a block larger than 144 bytes, H.J's memset aligns the block to 16
bytes and handles the prologue with another integer jmp table. The
prologue is common blended code path for all x86-64 processors. After
being aligned, blocks larger than 144 bytes can follow an SSE code path
or an integer code path based on what the sysconfig indicates for a
given x86-64 processor. Currently any AMD processor follows the integer
code path and that is the AMD recommended path for memset. Any block
larger than 144 bytes will also reuse the 2 byte to 144 byte jmp table
for epilogue, irrespective of the x86-64 processor. So the alignment,
prologue and epilogue code are common blended code paths. On the other
hand, AMD's memset aligns the block if it is larger than or equal to 512
bytes and aligns it to 8 bytes.

For blocks larger than 144 bytes, AMD plans to do some analysis to
understand whether the early 16 byte alignment and/or the prologue
and/or epilogue code are contributing to any slow down in those blocks
or whether the AMD's memset needs to be improved. 

H.J: Can you clarify how the 144 byte boundary was chosen to end the
integer jmp table and align blocks?
 
For the non-SSE2 code path beyond 144 bytes, we would like to integrate
the code used in AMD's memset (including any improvements we make) that
gives us a measurable speedup on Barcelona. 
For eg, the use of rep stos between 8KB and 48KB. 
Another improvement for us is at blocks larger than the largest cache
size (L2 or L3 if avalaible) (when NOT_IN_GLIBC is defined) or half the
largest cache size (when NOT_IN_GLIBC is not defined). In this range the
sub block that is smaller than the full or half cache size is set with
rep stos and the remaining sub block is set with movnti.

I would appreciate any feedback from the list.

Thanks,
Harsha

Attachment: 001-memset-amd.diff
Description: 001-memset-amd.diff

Attachment: memset_perf_data_comp.txt
Description: memset_perf_data_comp.txt


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]