This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH]: Performance improve on ARM memset


Phil Blundell wrote:
On Tue, 2008-10-28 at 17:00 -0700, Min Zhang wrote:
This patch improves the execution time of the memset. Tested by "time" shell utility on the following test program. The patch reduced execution time by 50%. Also sanity tested the memset with length from 0 byte to 1000 bytes, just to make sure it doesn't memset any extra or less bytes.

int main()
{
       char* p = malloc(4096);
       for (int i=0; i<100000; i++) {
               memset(p, 0, 4096);
       }
}

Note: This patch sort of undo the
http://sources.redhat.com/cgi-bin/cvsweb.cgi/ports/sysdeps/arm/memset.S.diff?r1=1.4&r2=1.5&cvsroot=glibc by reverting "str" back to more efficient block copy "stm" instruction. I am not sure the reason behind the rev 1.5 change.

What CPU are you benchmarking on? I think the reason for the rev 1.5
change was that, on some processors (particularly StrongARM and/or
XScale), a two-word STM is slower than two STRs under most common
circumstances. If I remember right, STM will be 50% slower than STR+STR
on xscale if the writes hit in the cache.


I am using ARMv6 on omap2430 board. I believe the arch ref is http://www.arm.com/miscPDFs/14128.pdf. Here is the /proc/cpuinfo:

Processor : ARMv6-compatible processor rev 6 (v6l)

BogoMIPS : 329.31

Features : swp half thumb fastmult vfp edsp java

CPU implementer : 0x41

CPU architecture: 6TEJ

CPU variant : 0x0

CPU part : 0xb36

CPU revision : 6

Cache type : write-back

Cache clean : cp15 c7 ops

Cache lockdown : format C

Cache format : Harvard

I size : 32768

I assoc : 4

I line length : 32

I sets : 256

D size : 32768

D assoc : 4

D line length : 32

D sets : 256



Hardware : OMAP2430 sdp2430 board

Revision : 24300224

Serial : 0000000000000000


The two circumstances I can think of where your change might be a win
are:

- cpus with no icache, where reducing the number of i-fetches is
important (presumably not the case for you); or

- cpus whose dcache allocates only on reads (most ARMs are like this)
and where STM gives you better external bus utilisation than STR+STR in
the case of a miss (I'm not sure offhand on what processors this is
true).

Can you try re-benchmarking your change against cached data to see what
happens there?

p.

Same result, STM is faster. I reduced length to memset(p, 0, 128) assuming 128 bytes should be small enough to stay in the dcache if I while(1) it. I also tried bigger length like memset(p,0,32k*16) assume none of it will fit in dcache of 32KB, STM still faster.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]