This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] improve generic strlen.


On Tue, Dec 23, 2014 at 02:40:44PM -0500, Rich Felker wrote:
> On Tue, Dec 23, 2014 at 07:57:14PM +0100, OndÅej BÃlka wrote:
> > Hi
> > 
> > As I promised earlier here is working version of improved strlen.
> > For long strings on x64 its twice as fast as current C version. Or 50%
> > slower when there is 1/128 probability of character 128.
> > 
> > For simplicity it assumes unaligned loads. For aligned ones I would need
> > to add define to mask first bytes.
> > 
> > Could you try this on arm or other arch to see what improvement is
> > possible?
> > 
> > Also while header is most important it needs platform specific
> > optimizations so if this is improvement next step is take assembly and
> > fix gcc mistakes.
> 
> FYI, the fastest strlen I've found on 32-bit x86 (and perhaps other
> 32-bit archs; I don't remember) is this:
> 
> size_t unrolled_strlen(const char *s) {
> 	const char *p = s;
> 	for (;;) {
> 		if (!p[0]) return p-s;
> 		if (!p[1]) return p-s+1;	
> 		if (!p[2]) return p-s+2;
> 		if (!p[3]) return p-s+3;
> 		if (!p[4]) return p-s+4;
> 		if (!p[5]) return p-s+5;
> 		if (!p[6]) return p-s+6;
> 		if (!p[7]) return p-s+7;
> 		if (!p[8]) return p-s+8;
> 		if (!p[9]) return p-s+9;
> 		if (!p[10]) return p-s+10;
> 		if (!p[11]) return p-s+11;
> 		if (!p[12]) return p-s+12;
> 		if (!p[13]) return p-s+13;
> 		if (!p[14]) return p-s+14;
> 		if (!p[15]) return p-s+15;
> 		p += 16;
> 	}
> }
> 
> The nice thing is that it really is pure C -- no UB hacks like reading
> past the end of the string or aliasing violations. Sadly it can't keep
> up with larger-than-32-bit vectorization though, but it's still pretty
> close. Also sadly GCC cannot do the unrolling properly automatically;
> it generates some very slow code with loop overhead repeated between
> every load-and-branch.
> 
> On my atoms it runs at 1.15 cycles per byte. I don't have the figures
> for other cpus handy.
> 
How does that compare to what I posted when you change vec_t to following?

typedef uint32_t vec_t;


Also there is second trick that helps on some workloads but needs
runtime profiling as it works only sometimes, change loop fast check to following heuristic:


  while (1)
    {
      s += ITER_SIZE;

      s0 = LOAD (s);
      s1 = LOAD (s + VEC_SIZE);
      s0 &= LOAD (s + 2 * VEC_SIZE);
      s1 &= LOAD (s + 3 * VEC_SIZE);
      m0 = s0 ^ (s0 - ONES);
      m1 = s1 ^ (s1 - ONES);
      m = (m0 | m1) & HIGH_BITS;

      if (m)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]