This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: PATCH: Optimized memset for x86-64
- From: "H.J. Lu" <hjl at lucon dot org>
- To: "Jagasia, Harsha" <harsha dot jagasia at amd dot com>
- Cc: libc-alpha at sourceware dot org, drepper at redhat dot com
- Date: Mon, 26 Nov 2007 15:25:33 -0800
- Subject: Re: PATCH: Optimized memset for x86-64
- References: <D5B24B5251882048AD03DDFA431BB790012F0504@SAUSEXMB3.amd.com>
On Mon, Nov 26, 2007 at 03:15:54PM -0600, Jagasia, Harsha wrote:
> Hi,
>
> I have tested the memset posted by H.J on AMD's K8 processor. The graph
> is attached. The baseline is the original routine in glibc. The
> performance of H.J's memset is plotted as a percentage of the
> performance of the original memset.
>
> Here are some of the key observations:
>
> - For small blocks (upto 115 bytes), H.J's memset is at par with the
> original memset.
>
Can you clarify what you meant by "at par"? Up to 100byte, the new one
is much faster, up to 50%.
> - For medium block sizes (between 116 and the largest cache size), there
> are several misaligned and aligned blocks that under perform the
> original memset by 10% to 20%.
>
> I plan to investigate why medium blocks perform poorly and will report
> on this soon.
>
> - For very large block sizes (larger than largest cache size), the
> performance is at par. The relative improvement seen between 128KB and
> 512KB is because the original memset is under utilizing the cache by
> doing streaming stores too early.
>
> As is, H.J's routine hurts performance significantly on K8 for medium
> blocks. I also plan to post results on the AMD Barcelona processor soon.
> I plan to fix the issues pointed out by Ulrich in AMD's previous
> submission and add an AMD path that addresses the performance issues
> noted above.
We are investigate misaligned and medium blocks. We can compare
performance later.
H.J.