This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Wilco Dijkstra <wdijkstr at arm dot com>
Cc: 'Richard Earnshaw' <Richard dot Earnshaw at foss dot arm dot com>, GNU C Library <libc-alpha at sourceware dot org>
Date: Tue, 9 Jun 2015 10:53:23 +0200
Subject: Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
Authentication-results: sourceware.org; auth=none
References: <A610E03AD50BFC4D95529A36D37FA55E769B14FEFF at GEORGE dot Emea dot Arm dot com> <000901d09ecd$5dc2b4b0$19481e10$ at com>

On Thu, Jun 04, 2015 at 02:50:07PM +0100, Wilco Dijkstra wrote:
> > OndÅej BÃlka wrote:
> > On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
> > > On 25/05/15 12:45, OndÅej BÃlka wrote:
> > > > Replaces it with strcpy. One could argue that opposite way to replace
> > > > strcpy with stpcpy is faster.
> > > >
> > > > Reason is register pressure. Strcpy needs extra register to save return
> > > > value while stpcpy has return value already in register used for writing
> > > > terminating zero.
> > >
> > >
> > > Depends on your architecture.  On aarch64 we have plenty of spare
> > > registers, so strcpy simply copies the destination register into a
> > > scratch.  It then doesn't have to carefully calculate the return value
> > > at the end of the function (making the tail code simpler - there are
> > > multiple return statements, but only one entry point).
> > >
> > Thats correct, main saving you get is from return value is first register, that
> > forces needing extra copy which is suboptimal.
> 
> No you don't need an extra copy. The current AArch64 strcpy code doesn't do it, 
> neither does my new strlen code, memcpy, memset or memmove. 
> There is however an overhead in returning the last byte for stpcpy.
> 
> > I don't have data how strcpy and stpcpy mix and want to know if few
> > extra cycles are worth it when these aren't called exactly often, I will
> > try to think how test these.
> 
> The usual problem of knowing whether all targets define assembler versions of
> stpcpy applies - so I don't think it is a good idea to change all strcpy into
> stpcpy in general. The only useful case is strcpy(x,y)+strlen(x) which could 
> potentially give a major speedup.
> 
Then its situation where it decision depends on implementation details,
as on some architectures you could save some cycles with stpcpy itself.

As useful cases, on gcc thread I said that gcc could use available
length to convert strchr to memchr and similar optimizations so strcpy
will be called more.

Then as I mentioned cache issues so far I measured mostly noise. I know
that overall stpcpy is often five times less called than strcpy, so
potential is there but it depends on actual savings when strcpy costs 
cycle less.
Data about strcpy and stpcpy when running make of zlib with debian gcc-5 
are following:

./summary_strcpy 

calls 52218
average n:   71.0    n <= 0:   7.4% n <= 4:  37.1% n <= 8:  52.8% n <=
16:  69.7% n <= 24:  77.4% n <= 32:  81.8% n <= 48:  86.6% n <= 64:
91.4% 
s aligned to 4 bytes:  87.1%  8 bytes:  86.0% 16 bytes:   7.4% 
average *s access cache latency    2.5    l <= 8:  95.3% l <= 16:  99.5%
l <= 32:  99.9% l <= 64:  99.9% l <= 128:  99.9% 
s2 aligned to 4 bytes:  84.9%  8 bytes:  79.9% 16 bytes:  10.5% 
s-s2 aligned to 4 bytes:  78.4%  8 bytes:  72.7% 16 bytes:  67.8% 
average *s2 access cache latency    1.7    l <= 8:  95.5% l <= 16:
99.4% l <= 32:  99.9% l <= 64:  99.9% l <= 128:  99.9%

./summary_stpcpy 

calls 4950
average n:    7.5    n <= 0:   1.7% n <= 4:  76.9% n <= 8:  77.2% n <=
16:  79.1% n <= 24:  87.6% n <= 32:  95.5% n <= 48:  95.5% n <= 64:
100.0% 
s aligned to 4 bytes:  24.7%  8 bytes:  24.7% 16 bytes:  24.7% 
average *s access cache latency    0.5    l <= 8:  99.4% l <= 16:  99.9%
l <= 32:  99.9% l <= 64:  99.9% l <= 128: 100.0% 
s2 aligned to 4 bytes:  95.2%  8 bytes:  24.7% 16 bytes:  24.7% 
s-s2 aligned to 4 bytes:  24.7%  8 bytes:  24.7% 16 bytes:  24.7% 
average *s2 access cache latency    0.5    l <= 8:  98.9% l <= 16:
99.9% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0%

I originally tried to use perf record make and see number of icache
misses with following preloaded. First could I run perf recursively as
it only records make but not invoked commands. Second for make itself
there is lot of noise in icache misses so it needs to be measured more
carefully.

define _GNU_SOURCE
#include <dlfcn.h>
void *(*strp)(char *,char*);
void __attribute__((constructor))foo(){
  strp=dlsym(RTLD_NEXT,"strcpy");
}

char *strcpy(char *x, char *y){
  strp(x,y);
  return x;
}

Follow-Ups:
- RE: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
  - From: Wilco Dijkstra

References:
- RE: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]