This is the mail archive of the
libc-ports@sources.redhat.com
mailing list for the libc-ports project.
Re: [PATCH]: Performance improve on ARM memset
- From: Phil Blundell <philb at gnu dot org>
- To: Min Zhang <mzhang at mvista dot com>
- Cc: libc-ports at sources dot redhat dot com
- Date: Wed, 29 Oct 2008 09:03:11 +0000
- Subject: Re: [PATCH]: Performance improve on ARM memset
- References: <4907A78A.30204@mvista.com>
On Tue, 2008-10-28 at 17:00 -0700, Min Zhang wrote:
> This patch improves the execution time of the memset. Tested by "time"
> shell utility on the following test program. The patch reduced execution
> time by 50%. Also sanity tested the memset with length from 0 byte to
> 1000 bytes, just to make sure it doesn't memset any extra or less bytes.
>
> int main()
> {
> char* p = malloc(4096);
> for (int i=0; i<100000; i++) {
> memset(p, 0, 4096);
> }
> }
>
> Note: This patch sort of undo the
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/ports/sysdeps/arm/memset.S.diff?r1=1.4&r2=1.5&cvsroot=glibc
> by reverting "str" back to more efficient block copy "stm" instruction.
> I am not sure the reason behind the rev 1.5 change.
What CPU are you benchmarking on? I think the reason for the rev 1.5
change was that, on some processors (particularly StrongARM and/or
XScale), a two-word STM is slower than two STRs under most common
circumstances. If I remember right, STM will be 50% slower than STR+STR
on xscale if the writes hit in the cache.
The two circumstances I can think of where your change might be a win
are:
- cpus with no icache, where reducing the number of i-fetches is
important (presumably not the case for you); or
- cpus whose dcache allocates only on reads (most ARMs are like this)
and where STM gives you better external bus utilisation than STR+STR in
the case of a miss (I'm not sure offhand on what processors this is
true).
Can you try re-benchmarking your change against cached data to see what
happens there?
p.