This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Leonhard Holz <leonhard dot holz at web dot de>
- Cc: libc-alpha at sourceware dot org
- Date: Sun, 28 Jun 2015 10:57:53 +0200
- Subject: Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
- Authentication-results: sourceware.org; auth=none
- References: <558EA828 dot 3080106 at web dot de>
On Sat, Jun 27, 2015 at 03:42:00PM +0200, Leonhard Holz wrote:
> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
> regression. The cause of the problem is that
>
> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
> sort weight which causes the comparison to check the whole string instead of
> breaking up early and
>
> b) the sequence-to-weight list is partitioned by the first byte of the first
> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
> they tend to have an equal starting byte (e.g. all thai chars start with E0).
>
> The approach of the patch is to interprete TABLEMB as a hashtable and find a
> better hash key. My first try was to somehow "fold" a multibyte character into one
> byte but that worsened the overall performance a lot. Enhancing the table to 2
> byte keys works much better while needing a reasonable amount of extra memory.
>
> The patch vastly improves the performance of languages with multibyte chars (see
> zh_CN and hi_IN below). A side effect is that languages with one-byte chars get a
> bit slower because of the extra check for the first byte while finding the right
> sequence in the sequence list . It cannot be avoided since the hash key is not
> longer equal to the first byte of the sequence. Tests are ok.
>
No, you can. With this patch you partially fixed a problem that
datastructures used in findidx are terrible.
Just use a trie. You would avoid slow lists by precomputing it so you
could do following:
struct trie
{
int32_t index[256];
}
int32_t
findidx (unsigned char *p, struct trie *start)
{
struct trie *t = start;
while (1)
{
unsigned char c = *p++;
if (t->index[c] >= 0)
return t->index[c];
else
t = start - t->index[c];
}
}
Also you don't need table access at all for utf8 unless somebody adds
locale with exotic characters. Just use utf8 codepoints. Here is how
calculate them for 1-3 byte sequences for first 65536 indices and for
4-byte use trie.
if (p[0] < 0x80) {
return p[0];
} else if (p[0] < 0xE0) {
/* 2-byte sequence */
/* No need to check size due trailing 0. */
return (p[0] << 6) + p[1] - 0x3080;
} else if (code_unit1 < 0xF0) {
/* 3-byte sequence */
if (size < 3) goto error;
return (p[0] << 12) + (p[1] << 6) + p[2] - 0xE2080;
}