This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH]: Performance improve on ARM memset


On Tue, 2008-10-28 at 17:00 -0700, Min Zhang wrote:
> This patch improves the execution time of the memset.  Tested by "time" 
> shell utility on the following test program. The patch reduced execution 
> time by 50%. Also sanity tested the memset with length from 0 byte to 
> 1000 bytes, just to make sure it doesn't memset any extra or less bytes.
> 
> int main()
> {
>        char* p = malloc(4096);
>        for (int i=0; i<100000; i++) {
>                memset(p, 0, 4096);
>        }
> }
> 
> Note: This patch sort of undo the
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/ports/sysdeps/arm/memset.S.diff?r1=1.4&r2=1.5&cvsroot=glibc 
> by reverting "str" back to more efficient block copy "stm" instruction. 
> I am not sure the reason behind the rev 1.5 change.

What CPU are you benchmarking on?  I think the reason for the rev 1.5
change was that, on some processors (particularly StrongARM and/or
XScale), a two-word STM is slower than two STRs under most common
circumstances.  If I remember right, STM will be 50% slower than STR+STR
on xscale if the writes hit in the cache. 

The two circumstances I can think of where your change might be a win
are:

- cpus with no icache, where reducing the number of i-fetches is
important (presumably not the case for you); or

- cpus whose dcache allocates only on reads (most ARMs are like this)
and where STM gives you better external bus utilisation than STR+STR in
the case of a miss (I'm not sure offhand on what processors this is
true).

Can you try re-benchmarking your change against cached data to see what
happens there?

p.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]