This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
RE: PATCH: Optimized memset for x86-64
- From: "Jagasia, Harsha" <harsha dot jagasia at amd dot com>
- To: "H.J. Lu" <hjl at lucon dot org>
- Cc: libc-alpha at sourceware dot org, drepper at redhat dot com
- Date: Fri, 14 Dec 2007 16:14:10 -0600
- Subject: RE: PATCH: Optimized memset for x86-64
Hi,
>>
>> I have tested the memset posted by H.J on AMD's K8 processor. The
graph
>> is attached. The baseline is the original routine in glibc. The
>> performance of H.J's memset is plotted as a percentage of the
>> performance of the original memset.
>>
>> Here are some of the key observations:
>>
>> - For small blocks (upto 115 bytes), H.J's memset is at par with the
>> original memset.
>>
>
>Can you clarify what you meant by "at par"? Up to 100byte, the new one
>is much faster, up to 50%.
>
Sorry for not replying earlier, I meant it's at par or faster.
>> - For medium block sizes (between 116 and the largest cache size),
there
>> are several misaligned and aligned blocks that under perform the
>> original memset by 10% to 20%.
>>
>> I plan to investigate why medium blocks perform poorly and will
report
>> on this soon.
>>
>> - For very large block sizes (larger than largest cache size), the
>> performance is at par. The relative improvement seen between 128KB
and
>> 512KB is because the original memset is under utilizing the cache by
>> doing streaming stores too early.
>>
>> As is, H.J's routine hurts performance significantly on K8 for medium
>> blocks. I also plan to post results on the AMD Barcelona processor
soon.
>> I plan to fix the issues pointed out by Ulrich in AMD's previous
>> submission and add an AMD path that addresses the performance issues
>> noted above.
>
>We are investigate misaligned and medium blocks. We can compare
>performance later.
I posted some data on Barcelona as well in a follow on thread.
>
>
>H.J.
>