This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
- From: Rich Felker <dalias at libc dot org>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Leonhard Holz <leonhard dot holz at web dot de>, libc-alpha at sourceware dot org
- Date: Mon, 29 Jun 2015 11:58:08 -0400
- Subject: Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
- Authentication-results: sourceware.org; auth=none
- References: <558EA828 dot 3080106 at web dot de> <20150628085753 dot GA4254 at domone>
On Sun, Jun 28, 2015 at 10:57:53AM +0200, OndÅej BÃlka wrote:
> Also you don't need table access at all for utf8 unless somebody adds
> locale with exotic characters. Just use utf8 codepoints. Here is how
> calculate them for 1-3 byte sequences for first 65536 indices and for
> 4-byte use trie.
>
>
> if (p[0] < 0x80) {
> return p[0];
> } else if (p[0] < 0xE0) {
> /* 2-byte sequence */
> /* No need to check size due trailing 0. */
> return (p[0] << 6) + p[1] - 0x3080;
> } else if (code_unit1 < 0xF0) {
> /* 3-byte sequence */
> if (size < 3) goto error;
> return (p[0] << 12) + (p[1] << 6) + p[2] - 0xE2080;
> }
Maybe in the special case where you intend to use this it doesn't
matter, but in general this is unsafe because it assumes only legal
sequences appear.
Is there a reason you even need to get codepoints? One of the
properties of UTF-8 is that codepoints sort in the same order as code
units.
Rich