This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
- From: Leonhard Holz <leonhard dot holz at web dot de>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 30 Jun 2015 07:52:21 +0200
- Subject: Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
- Authentication-results: sourceware.org; auth=none
- References: <558EA828 dot 3080106 at web dot de> <20150628085753 dot GA4254 at domone>
> Just use a trie. You would avoid slow lists by precomputing it so you
> could do following:
>
> struct trie
> {
> int32_t index[256];
> }
>
> int32_t
> findidx (unsigned char *p, struct trie *start)
> {
> struct trie *t = start;
> while (1)
> {
> unsigned char c = *p++;
> if (t->index[c] >= 0)
> return t->index[c];
> else
> t = start - t->index[c];
> }
> }
>
>
I agree, but that goes way beyond the scope of the patch. I suggest to use the
patch given to solve the regression problem and refactor the datastructure as a
follow-up.
> Also you don't need table access at all for utf8 unless somebody adds
> locale with exotic characters. Just use utf8 codepoints. Here is how
> calculate them for 1-3 byte sequences for first 65536 indices and for
> 4-byte use trie.
>
>
> if (p[0] < 0x80) {
> return p[0];
> } else if (p[0] < 0xE0) {
> /* 2-byte sequence */
> /* No need to check size due trailing 0. */
> return (p[0] << 6) + p[1] - 0x3080;
> } else if (code_unit1 < 0xF0) {
> /* 3-byte sequence */
> if (size < 3) goto error;
> return (p[0] << 12) + (p[1] << 6) + p[2] - 0xE2080;
> }
>
>
Actually the code does convert the UTF-8 sequences to codepoints and uses them as
hashkey while ignoring the upper bytes in the four-bytes-UTF-8 case:
+ if ((index & 0xE0) == 0xC0)
+ {
+ if (len < 2)
+ goto utf8_error;
+ uint16_t byte2 = (*cpp)[1];
+ index = ((index & 0x1F) << 6) | (byte2 & 0x3F);
+ }
+ else if ((index & 0xF0) == 0xE0)
+ {
+ if (len < 3)
+ goto utf8_error;
+ uint16_t byte2 = (*cpp)[1];
+ uint16_t byte3 = (*cpp)[2];
+ index = ((index & 0xF) << 12) | ((byte2 & 0x3F) << 6) |
+ (byte3 & 0x3F);
+ }
+ else if ((index & 0xF8) == 0xF0)
+ {
+ if (len < 4)
+ goto utf8_error;
+ uint16_t byte3 = (*cpp)[2];
+ uint16_t byte4 = (*cpp)[3];
+ index = ((index & 0xF) << 12) | ((byte3 & 0x3F) << 6) |
+ (byte4 & 0x3F);
+ }
+ else
+ {
+ utf8_error:
+ *cpp += 1;
+ return 0;
+ }
+ }
Maybe my conversion method is a bit "academic" but I think it is a reliable one.