This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] Statistics of non-ASCII characters in strings

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Wilco Dijkstra <wdijkstr at arm dot com>
Cc: 'Rich Felker' <dalias at libc dot org>, libc-alpha at sourceware dot org
Date: Mon, 5 Jan 2015 19:29:37 +0100
Subject: Re: [RFC] Statistics of non-ASCII characters in strings
Authentication-results: sourceware.org; auth=none
References: <001401d01df6$0f7cc5a0$2e7650e0$ at com> <E1Y34Yu-0004LC-KH at fencepost dot gnu dot org> <A610E03AD50BFC4D95529A36D37FA55E38C7897324 at GEORGE dot Emea dot Arm dot com> <20141222225044 dot GU4574 at brightrain dot aerifal dot cx> <A610E03AD50BFC4D95529A36D37FA55E38C7897326 at GEORGE dot Emea dot Arm dot com> <20141223002521 dot GV4574 at brightrain dot aerifal dot cx> <001701d028fd$f72a5d10$e57f1730$ at com>

On Mon, Jan 05, 2015 at 03:40:42PM -0000, Wilco Dijkstra wrote:
> > Rich Felker wrote:
> > On Mon, Dec 22, 2014 at 11:53:39PM +0000, Wilco Dijkstra wrote:
> > > Rich Felker wrote:
> > > > It's not quite clear to me from your reply, but I get the impression
> > > > you're comparing an ASCII-optimized strlen to a non-optimized one,
> > > > rather than comparing it to the alternate optimization that also works
> > > > for non-ASCII bytes. This is a standard implementation trade-off, not
> > > > Aarch64-specific, and in general the right answer is to use the
> > > > slightly more expensive code that works for all bytes rather than the
> > > > version that has to take a slow-path when it gets a false-positive nul
> > > > terminator on non-ASCII bytes.
> > >
> > > No the comparison is between the existing optimized version (which
> > > already processes 16 characters per iteration) and an even more optimized
> > > version. Some of the speedup is due to my trick to process ASCII strings faster.
> > > Non-ascii strings are still at least as fast as the original version (usually much
> > > faster), so it's definitely not a slow path, but rather a fast path plus an
> > > extremely fast path.
> > 
> > Can you clarify how you're doing this? Since you can't know a priori
> > whether the data will be ASCII or not, I don't understand how you can
> > first use a test that gives the wrong result for non-ASCII and then a
> > secondary test to get the right result without hurting performance for
> > non-ASCII.
> 
> I'll post my patch later, but the fast check is a subset of the slower,
> more accurate check, so if the fast check fails it has executed just 3
> extra instructions compared to doing only the accurate check. If the
> accurate check detects no zeroes, I use a 2nd loop as a fallback for
> non-ASCII strings. This only uses the accurate check so that you don't 
> pay this overhead more than once per call.
> 
> So if we get mostly ASCII strings we always use the fastest case. 
> If we get mostly non-ASCII strings we only pay the cost of ~1 cycle for 
> the additional instructions (which is more than made up for by various 
> other improvements, so we're never worse off than the previous version).
> 
> That means the only hard case for the branch predictor is if there are
> a similar number of ASCII and non-ASCII strings and they happen randomly.
> If the figure of >98% strings being ASCII 
> (http://sourceware.org/ml/libc-alpha/2014-12/msg00879.html) is true for 
> GLIBC string functions too then all is well. 
> 
That is misunderstanding statistic, Rich is wrong here as this could
harm performance than theoretical worst-case scenarios.

There are two relevant factors. First that most strings are short.
Second is that there are not two types of string: pure ascii/nonascii
but for each workload you have probability that character is nonascii.

Say you have language where every 64th character is accented. That makes
switch a unpredictable branch so you will pay big overhead. As you work
with file that has lines less than 80 characters is unlikely that line
contains two accents.



> > Of course the standard filesystem paths are the same, but the result
> > of things like strlen("/usr/local/bin") is not the important case.
> > What's much more likely to matter is string operations encountered
> > processing symbol tables/relocations, parsing source code or
> > text-based data, etc. These are cases where you're going to encounter
> > a lot of ASCII-only strings regardless of the user's language.
> 
> A fast parser/scanner typically does not use any library functions.
> Last time I wrote a scanner, I computed the length, hash and whether a
> character was a valid identifier char using just 4 instructions per
> character.
>
If one uses dedicated parser it should not be problem. I am more worried
about use cases like reading config files that programmer does not write
with performance in mind.

References:
- RE: [RFC] Statistics of non-ASCII characters in strings
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]