This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH][BZ #12100] strstr with unaligned loads.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Rich Felker <dalias at aerifal dot cx>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 27 Aug 2013 14:16:41 +0200
- Subject: Re: [PATCH][BZ #12100] strstr with unaligned loads.
- Authentication-results: sourceware.org; auth=none
- References: <20130825090758 dot GA23686 at domone dot kolej dot mff dot cuni dot cz> <20130825155406 dot GV20515 at brightrain dot aerifal dot cx> <20130826070925 dot GA4292 at domone dot kolej dot mff dot cuni dot cz> <20130826162446 dot GZ20515 at brightrain dot aerifal dot cx> <20130827020414 dot GA8611 at domone dot kolej dot mff dot cuni dot cz> <20130827022042 dot GH20515 at brightrain dot aerifal dot cx> <20130827083048 dot GA12761 at domone dot kolej dot mff dot cuni dot cz> <20130827084500 dot GL20515 at brightrain dot aerifal dot cx>
On Tue, Aug 27, 2013 at 04:45:00AM -0400, Rich Felker wrote:
> On Tue, Aug 27, 2013 at 10:30:48AM +0200, OndÅej BÃlka wrote:
> > > winning on random testcases is meaningless since it has quadratic
> > > performance in the worst-case.
> > > For strstr random input is always the
> > > easy case.
> >
> > How good is it on binary alphabeth?
>
> Two-way is very good on binary alphabet, and on 4-letter alphabet
> (e.g. DNA). The bad character table optimization is rather useless on
> small alphabets, of course, but that's not a core part of the
> algorithm, just an optimization that helps achieve sublinear
> performance in many real-world usages (while maintaining worst-case
> linear performance).
>
> Note that having an optimized search for the next occurrance of the
> first character or two is also rather useless in a binary alphabet...
>
A binary alphabet tends to be worst case for optimized implementations.
A most trick do not apply there, most shifts tend to be 1 or 2 bytes and
branches tend to be mispredicted. A performance impact changing alphabet from
binary to ternary in two-way is big.
> > > > > presumably because musl always uses the bad-character table). There
> > > > > may be additional logic considerations to integrating any optimized
> > > > > "next pair" search with the bad character table; I have not checked
> > > > > this.
> > > > >
> > > > Bad character is also heuristic that is not needed when pair heuristic
> > > > is used as it would only slow it down.
> > >
> > > Bad character table changes the typical runtime from O(n) to O(n/m).
> > > This is HUGE.
> >
> > You are making basic mistake of arguing that some algorithm is better
> > in best case.
>
> My above comments were comparing two-way with bad character table to
> plain two-way. Bad character table does not always help (it just helps
> in many typical real-world cases), but it never hurts more than a
> small O(m) (m = needle length) startup cost. In my implementation, I
> combine this with the operation of computing the needle length. Of
> course strlen is a faster O(m) operation by several times, so one
> could argue that the bad character table still "costs too much", but
> empirically I found my version of two-way (in musl) outperforming
> glibc's version of two-way (not the SSE code in glibc) on many
> reasonable inputs where the needle was 16-31 bytes.
>
> > > > > On Sun, Aug 25, 2013 at 11:07:58AM +0200, OndÅej BÃlka wrote:
> > > > > > This could be solved by standard buy or rent techique which I employ.
> > > > > > For any position p there could be only p+64n+512 comparisons until I
> > > > > > switch to two way algorithm which gives linear complexity.
> > >
> > > That's a reasonable solution, but I think it would be even better to
> > > integrate your SSE tricks with two-way.
> > >
> > Its not worth doing from my perspective as factorization step cost
> > around 20 cycles per needle byte. In that time I could often find three
> > times that there is no match.
>
> I think we'd need to see more empirical data to determine this. Your
> switchover threshold is not exactly cheap. If the intent is just to
> give good performance for "typical" (e.g. English text) uses of
> strstr, and otherwise only avoid pathologically bad performance that
> could be used for DoS, then your approach may be reasonable. On the
Indent was get best possible performance for typical cases and
reasonable for all cases.
> other hand, if you want to support arbitrary reasonable uses of strstr
> (like DNA searches in an entire genome), it's probably not adequate.
> But that's just my semi-educated guess. A real determination would
> require empirical measurement.
>
You can not beat specialized algorithm (DNA search) without supplying
contextual information to algorithm. When you decide switch based of
only current run data you will slow common case down.
A profile-guided optimization is possible but it would need add
infrastructure to save context info.