This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] Statistics of non-ASCII characters in strings
- From: Rich Felker <dalias at libc dot org>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Wilco Dijkstra <wdijkstr at arm dot com>, libc-alpha at sourceware dot org
- Date: Tue, 23 Dec 2014 14:25:16 -0500
- Subject: Re: [RFC] Statistics of non-ASCII characters in strings
- Authentication-results: sourceware.org; auth=none
- References: <001401d01df6$0f7cc5a0$2e7650e0$ at com> <20141223104421 dot GA17643 at domone>
On Tue, Dec 23, 2014 at 11:44:21AM +0100, OndÅej BÃlka wrote:
> On Mon, Dec 22, 2014 at 02:46:24PM -0000, Wilco Dijkstra wrote:
> > Does anyone have statistics of how often strings contain non-ASCII characters? I'm asking because
> > it's feasible to make many string functions faster if they are predominantly ASCII by using a
> > different check for the null byte. So if say 80-90% of strings in strcpy/strlen are ASCII then it
> > would be well worth optimizing for it.
> >
> I do not know as it depends on encoding, you could collect that
> percentage from strcoll benchmark.
>
> For string functions just ascii/nonascii percentage is not enough, more
> refined statistic will tell you much more.
>
> For strlen you need only know probability of byte 128, which is quite
> small in practice.
If it's that optimization, note that in UTF-8 all characters of the
bit form xxxx000000xxxxxx contain byte 128. There actually aren't many
languages made up entirely of such characters; apparently only
Burmese/Myanmar. Otherwise it's mostly punctuation and a small portion
of CJK characters. Of course characters where 128 is the low byte also
appear once every 64 positions throughout unicode, and there are
non-PUA characters.
Still I think you'd risk making things slower with this optimization.
> For strchr its more tricky as you need know x/x+128 pair probability
> along with 0/128. Here fact that x varies is advantage as for most pairs
> that ratio is small, so weigthed average will be limited.
> You cannot have 11 characters each occuring with 10% probability.
Is there any trivial transformation so that the affected byte would be
255 rather than 128? I ask because byte 255 will never appear in UTF-8
so it would not matter except for non-text strings (which are still a
valid usage of string functions) or people running legacy encodings
(and IMO these should be deprecated and not considered a performance
criterion).
>
> Ascii/nonascii ratio would help to estimate strcasecmp
> performance. Here implementation already assumes that its dealing with
> ascii, when it needs convert nonascii it will be slow no matter what you
> do.
>
> I have generic C implementation of strlen using that trick, I will send
> it.
I'd be interested in seeing it.
Rich