This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Gcc builtin review: memcpy, mempcpy, memmove.


Andrew as I promised review and what you would need to fix for x64 gcc
 and remove ugly gcc hacks and replace them with nice headers :-)

I will take a functions in order from string.h and comment about
these.

Lose all hope, these are imposible to optimize purely with gcc.

These are hardest functions to optimize. A optimal implementation
depends on cpu and size, whether inputs are likely in L1 or L2 or L3
cache  and relationship is quite chaotic.

With gcc unless you start doing per-cpu function cloning and optimize
for each cpu (which would be good idea if you could convince maintainers
to deal with size bloat) then there will always be some cpu where you
lose.
 
Its because there are loop A and B and A is 50% faster than B on
cpu 1 but B is 50% faster on cpu 2

A main weakness for current cpus lies in 100-1000 byte range. A in
following benchmark an -march=native gcc expansion is outperformed 
by at least 10% with obsolete memcpy_sse2 from glibc-2.13 on 
sandy_bridge and haswell

As memcpy_sse2 uses only 8-byte loads that speaks for itself and newer
implementations are faster.

A benchmark to check instrinct is following. 
Compile a.c with -mstringop-strategy=libcall and you will see
performance improvement. Of course compile a.c and main separately to
avoid optimizing memcpy out.

a.c:

#include <string.h>
int memcpy2(char *c, char *s)
{
  return memcpy(c, s, 200);
}

#include <string.h>
extern int n;
int memcpy2(char *c, char *s)
{
  return memcpy(c, s, n);
}

main.c:

int n = 200;
int main(){
  int i;
  char u[40000], v[40000];
  for (i=0;i<1000000;i++)
    memcpy2(u + 16 * ((unsigned int) rand ()) % (1 << 10), v + + 16 * ((unsigned int) rand ()) % (1 << 10));
}

We dealt with suboptimal selection. 

Then you have problem that optimum also depends on profile feedback. If
you fixed previous selection it would be still wrong if you know that
input is mainly in L3 cache. For some reason rep stosq is by far best
implementation for these inputs in almost any intel cpu.

Then next problem is code size, I filled bug about that. Gcc does too 
excessive expansion.
memcpy(c, s, 120);
gets expanded to 125 byte sequence of movs 

While it may improve overall performance most of these expansions are
wrong as they lie on cold path and could easily introduce extra 60 cycle
penalty for fetching instructions.

A ugly problem is that upto certain size gcc needs to expand these for
optimizations to work, one wants to be able to determine that b[1] is b
in following fragment and do constant propagation of copied structures.
A question is after what size that doesn't make sense.

char *a = "abc";
memcpy(b,a,3);
return b[1];



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]