This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH v3] faster strlen on x64

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Dmitrieva Liubov <liubov dot dmitrieva at gmail dot com>
Cc: libc-alpha at sourceware dot org
Date: Wed, 6 Feb 2013 00:44:15 +0100
Subject: Re: [PATCH v3] faster strlen on x64
References: <20130131095215.GA31998@domone.kolej.mff.cuni.cz><CAHjhQ913DnUCSbkSXswd=C-k39L02cNmgEQbwh9PMSh1JkGnvA@mail.gmail.com>

On Thu, Jan 31, 2013 at 03:40:44PM +0400, Dmitrieva Liubov wrote:
> Looks good to me.
> I don't see format issues for this version.
> 
> Do you have strnlen performance data as your patch impacts strnlen also?
> 
> Can you please extract short performance review like average gain for
> AMD, Atom, SNB, IVX, Haswell in %.
> 
Best would be see graphs at 
http://kam.mff.cuni.cz/~ondra/benchmark_string/strlen_profile.html

Real performance depends on many factors. I describe it but it will not
be brief.

My implementation takes profiling information (see http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/result.html for my workload.)
into account. Important property there is that most strings are at most
80 bytes large.

When data are at L1 cache my implementation around 20% faster for strings upto
64 bytes large. For 64-256 byte strings sometimes my implementation is sligthly 
slower than pminub one. This is caused by that entering loop incours penalty which 
is repays us on 16 blocks/256 bytes where pminub implementation hits loop. and we 
become around 50 cycles faster.

When reading data from main memory my implementation is slower for first 32
bytes then implementations are close to each other. 
It is caused by my implementatation reading more bytes at once than
pminub, which allows my implementation run faster of in
practical cases. 

When almost all strings are at most 16 bytes large I could add
additional test which speeds this case up. However it also slows rest of
implementation down. In workloads I observed larger strings cause this
be worse in total. I cannot exploit it without doing profile guided
optimization.

This is illustrated by profiling gcc where upto 16 bytes pminub implementation is 
somewhat faster on older architectures, then my implementation gets significant 
savings for 16-48 byte strings. Benchmark gcc does not provide enougth
data for larger strings.
Note how is this slower than rand benchmark, probable cause are cache
misses.

For large strings speed depends pretty much only on data placement in
caches. 

For data in L1 cache my implemenation sligthly faster on all
architectures except phenomII where it is asymptoticaly faster.

For L2 and futher caches my implemenation is near optimum. I updated my older
benchmark at http://kam.mff.cuni.cz/~ondra/benchmark_string/ part strlen.
It compares my implementation with loop which only loads string into
registers and nothing else. My implementation gets close to this bound
and to improve it more we need faster loading strategy.

> --
> Liubov Dmitrieva
> Software Engineer
> Intel Corporation
> 
> 2013/1/31 OndÅej BÃlka <neleai@seznam.cz>:
> > Hi,
> >
> > Afetr testing by Liuba I prepared final version of my patch
> > (attached and on neleai/strlen branch.).
> >
> > I used hooking to examine behaviour of implementations in wild, it can be
> > downloaded on http://kam.mff.cuni.cz/~ondra/strlen_profile.tar.bz2
> > (Run ./benchmarks for unit tests, read TODO as it is not complete.)
> >
> > No aditional failures on x64.
> >
> > Uses of strlen_* in strcat are inlined for now, optimizations will come
> > after I deal with strcpy.
> >
> > It could be also use in linker, I split this functionality into
> > additional patch.
> >
> > Ondra
> >
> > 2013-01-31  Ondrej Bilka  <neleai@seznam.cz>
> >
> >         * sysdeps/x86_64/strlen.S: Replace with new SSE2 based
> >         implementation which is faster on all x86_64 architectures.
> >         Tested on AMD, Intel Nehalem, Atom, SNB, IVB, Haswell.
> >         * sysdeps/x86_64/strnlen.S: Likewise.
> >
> >         * sysdeps/x86_64/multiarch/Makefile (sysdep_routines):
> >         Remove all multiarch strlen and strnlen versions.
> >         * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Update.
> >         Remove strlen and strnlen related parts.
> >
> >         * sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Update.
> >         Inline strlen part.
> >         * sysdeps/x86_64/multiarch/strcat-ssse3.S: Likewise.
> >
> >         * sysdeps/x86_64/multiarch/strlen.S: Remove.
> >         * sysdeps/x86_64/multiarch/strlen-sse2-no-bsf.S: Remove.
> >         * sysdeps/x86_64/multiarch/strlen-sse2-pminub.S: Remove.
> >         * sysdeps/x86_64/multiarch/rtld-strlen.S: Remove.
> >         * sysdeps/x86_64/multiarch/strlen-sse4.S: Remove.
> >         * sysdeps/x86_64/multiarch/strnlen.S: Remove.
> >         * sysdeps/x86_64/multiarch/strnlen-sse2-no-bsf.S: Remove.

Follow-Ups:
- Re: [PATCH v3] faster strlen on x64
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]