This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Proposal to handle __strstr_sse42 and friends issue on x86


On Thu, Jan 09, 2014 at 11:58:56PM +0000, Nix wrote:
> On 14 Dec 2013, OndÅej BÃlka said:
> 
> > On Wed, Dec 11, 2013 at 10:55:01AM +1000, Allan McRae wrote:
> >> would likely remove any advantage of the sse42 routine (not tested...),
> >> and there are proposals to remove the sse42 routines for both x86 and
> >> x86_64 due to quadratic complexity anyway [3,4].
> 
> Please. Half the GNU tools end up replacing strstr() with gnulib's
> replacement strstr anyuway because of this. And they're right to do so.
>
Already done
 
> > sse42 routines are quite ineffective in that regard, with plain sse2 you
> > can get around five times faster. I planned to add a version that avoids
> > unaligned loads for older processors.
> 
> I'd say it's not worth bothering with any of this unless it implements
> the same algorithm as the C strstr(), rather than implementing something
> with quadratic slowdown in really fast assembler. It doesn't matter if
> strstr() is an imperceptible little bit faster on tiny needle / haystack
> combinations if it slows down quadratically on the big ones 

You are wrong here, its linear time, there is quite standard buy-or-rent
technique for that. You just maintain number of comparisons and switch
to two way algorithm when greater than twice number of bytes processed. 

That is exactly same discussion as here
https://www.sourceware.org/ml/libc-alpha/2013-08/msg00455.html
, please read archives before posting duplicate posts.


> where its
> performance hit is in any case most noticeable.

Also false, a biggest hit is when user splits file to lines and then
calls strstr on each line, a startup cost forms majority of time spend,
in my tests around 80% in time is spend in precomputing needle.

>  (Do we even know the
> distribution of needle / haystack sizes on real systems? A preloaded
> wrapper could tell us...)
> 
Already done, here:
http://kam.mff.cuni.cz/~ondra/benchmark_string/strstr_profile250813.tar.bz2

Most programs have same pattern as gcc which is caused by using a strstr
in linker:

http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strstr_profile/results_gcc/result.html

There 85% of input have haystack smaller than 16 bytes so please explain
how you could get improvement from algorithm.

> > You can also use this one you just improve performance 15 times instead
> > 30 if you expanded unaligned loads into aligned ones.
> 
> A 15-fold improvement is peanuts compared to the speedups you get from a
> better algorithm 

You quoted this bit off context, that comment compares my sse2 version
versus glibc two-way version. For most worloads this holds, for
pathological it switches to two-way algorithm and speed would be about
same. With suitably choosen constant for buy-or-rent switch my algorithm
would always be at most twice of time it takes a two-way algorithm.

> -- and the generic code has a better algorithm than the
> SSE4.2 code.

This is irelevant as we are not talking about SSE4.2 algorithm here.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]