This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] improve generic strlen.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Rich Felker <dalias at libc dot org>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 6 Jan 2015 19:16:04 +0100
- Subject: Re: [RFC] improve generic strlen.
- Authentication-results: sourceware.org; auth=none
- References: <20141223185714 dot GA7889 at domone> <20141223194044 dot GE4574 at brightrain dot aerifal dot cx>
On Tue, Dec 23, 2014 at 02:40:44PM -0500, Rich Felker wrote:
> On Tue, Dec 23, 2014 at 07:57:14PM +0100, OndÅej BÃlka wrote:
> > Hi
> >
> > As I promised earlier here is working version of improved strlen.
> > For long strings on x64 its twice as fast as current C version. Or 50%
> > slower when there is 1/128 probability of character 128.
> >
> > For simplicity it assumes unaligned loads. For aligned ones I would need
> > to add define to mask first bytes.
> >
> > Could you try this on arm or other arch to see what improvement is
> > possible?
> >
> > Also while header is most important it needs platform specific
> > optimizations so if this is improvement next step is take assembly and
> > fix gcc mistakes.
>
> FYI, the fastest strlen I've found on 32-bit x86 (and perhaps other
> 32-bit archs; I don't remember) is this:
>
> size_t unrolled_strlen(const char *s) {
> const char *p = s;
> for (;;) {
> if (!p[0]) return p-s;
> if (!p[1]) return p-s+1;
> if (!p[2]) return p-s+2;
> if (!p[3]) return p-s+3;
> if (!p[4]) return p-s+4;
> if (!p[5]) return p-s+5;
> if (!p[6]) return p-s+6;
> if (!p[7]) return p-s+7;
> if (!p[8]) return p-s+8;
> if (!p[9]) return p-s+9;
> if (!p[10]) return p-s+10;
> if (!p[11]) return p-s+11;
> if (!p[12]) return p-s+12;
> if (!p[13]) return p-s+13;
> if (!p[14]) return p-s+14;
> if (!p[15]) return p-s+15;
> p += 16;
> }
> }
>
> The nice thing is that it really is pure C -- no UB hacks like reading
> past the end of the string or aliasing violations. Sadly it can't keep
> up with larger-than-32-bit vectorization though, but it's still pretty
> close. Also sadly GCC cannot do the unrolling properly automatically;
> it generates some very slow code with loop overhead repeated between
> every load-and-branch.
>
> On my atoms it runs at 1.15 cycles per byte. I don't have the figures
> for other cpus handy.
>
How does that compare to what I posted when you change vec_t to following?
typedef uint32_t vec_t;
Also there is second trick that helps on some workloads but needs
runtime profiling as it works only sometimes, change loop fast check to following heuristic:
while (1)
{
s += ITER_SIZE;
s0 = LOAD (s);
s1 = LOAD (s + VEC_SIZE);
s0 &= LOAD (s + 2 * VEC_SIZE);
s1 &= LOAD (s + 3 * VEC_SIZE);
m0 = s0 ^ (s0 - ONES);
m1 = s1 ^ (s1 - ONES);
m = (m0 | m1) & HIGH_BITS;
if (m)