This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH]: Performance improve on ARM memset

From: Phil Blundell <philb at gnu dot org>
To: Min Zhang <mzhang at mvista dot com>
Cc: libc-ports at sources dot redhat dot com
Date: Wed, 29 Oct 2008 09:03:11 +0000
Subject: Re: [PATCH]: Performance improve on ARM memset
References: <4907A78A.30204@mvista.com>

On Tue, 2008-10-28 at 17:00 -0700, Min Zhang wrote:
> This patch improves the execution time of the memset.  Tested by "time" 
> shell utility on the following test program. The patch reduced execution 
> time by 50%. Also sanity tested the memset with length from 0 byte to 
> 1000 bytes, just to make sure it doesn't memset any extra or less bytes.
> 
> int main()
> {
>        char* p = malloc(4096);
>        for (int i=0; i<100000; i++) {
>                memset(p, 0, 4096);
>        }
> }
> 
> Note: This patch sort of undo the
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/ports/sysdeps/arm/memset.S.diff?r1=1.4&r2=1.5&cvsroot=glibc 
> by reverting "str" back to more efficient block copy "stm" instruction. 
> I am not sure the reason behind the rev 1.5 change.

What CPU are you benchmarking on?  I think the reason for the rev 1.5
change was that, on some processors (particularly StrongARM and/or
XScale), a two-word STM is slower than two STRs under most common
circumstances.  If I remember right, STM will be 50% slower than STR+STR
on xscale if the writes hit in the cache. 

The two circumstances I can think of where your change might be a win
are:

- cpus with no icache, where reducing the number of i-fetches is
important (presumably not the case for you); or

- cpus whose dcache allocates only on reads (most ARMs are like this)
and where STM gives you better external bus utilisation than STR+STR in
the case of a miss (I'm not sure offhand on what processors this is
true).

Can you try re-benchmarking your change against cached data to see what
happens there?

p.

Follow-Ups:
- Re: [PATCH]: Performance improve on ARM memset
  - From: Min Zhang

References:
- [PATCH]: Performance improve on ARM memset
  - From: Min Zhang

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]