This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] Statistics of non-ASCII characters in strings

From: Rich Felker <dalias at libc dot org>
To: OndÅej BÃlka <neleai at seznam dot cz>
Cc: Wilco Dijkstra <wdijkstr at arm dot com>, libc-alpha at sourceware dot org
Date: Tue, 23 Dec 2014 14:25:16 -0500
Subject: Re: [RFC] Statistics of non-ASCII characters in strings
Authentication-results: sourceware.org; auth=none
References: <001401d01df6$0f7cc5a0$2e7650e0$ at com> <20141223104421 dot GA17643 at domone>

On Tue, Dec 23, 2014 at 11:44:21AM +0100, OndÅej BÃlka wrote:
> On Mon, Dec 22, 2014 at 02:46:24PM -0000, Wilco Dijkstra wrote:
> > Does anyone have statistics of how often strings contain non-ASCII characters? I'm asking because
> > it's feasible to make many string functions faster if they are predominantly ASCII by using a
> > different check for the null byte. So if say 80-90% of strings in strcpy/strlen are ASCII then it
> > would be well worth optimizing for it.
> > 
> I do not know as it depends on encoding, you could collect that
> percentage from strcoll benchmark.
> 
> For string functions just ascii/nonascii percentage is not enough, more
> refined statistic will tell you much more.
> 
> For strlen you need only know probability of byte 128, which is quite
> small in practice.

If it's that optimization, note that in UTF-8 all characters of the
bit form xxxx000000xxxxxx contain byte 128. There actually aren't many
languages made up entirely of such characters; apparently only
Burmese/Myanmar. Otherwise it's mostly punctuation and a small portion
of CJK characters. Of course characters where 128 is the low byte also
appear once every 64 positions throughout unicode, and there are
non-PUA characters.

Still I think you'd risk making things slower with this optimization.

> For strchr its more tricky as you need know x/x+128 pair probability
> along with 0/128. Here fact that x varies is advantage as for most pairs 
> that ratio is small, so weigthed average will be limited.
> You cannot have 11 characters each occuring with 10% probability.

Is there any trivial transformation so that the affected byte would be
255 rather than 128? I ask because byte 255 will never appear in UTF-8
so it would not matter except for non-text strings (which are still a
valid usage of string functions) or people running legacy encodings
(and IMO these should be deprecated and not considered a performance
criterion).

> 
> Ascii/nonascii ratio would help to estimate strcasecmp
> performance. Here implementation already assumes that its dealing with
> ascii, when it needs convert nonascii it will be slow no matter what you
> do.
> 
> I have generic C implementation of strlen using that trick, I will send
> it.

I'd be interested in seeing it.

Rich

References:
- [RFC] Statistics of non-ASCII characters in strings
  - From: Wilco Dijkstra
- Re: [RFC] Statistics of non-ASCII characters in strings
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]