This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale

From: Rich Felker <dalias at libc dot org>
To: OndÅej BÃlka <neleai at seznam dot cz>
Cc: Leonhard Holz <leonhard dot holz at web dot de>, libc-alpha at sourceware dot org
Date: Mon, 29 Jun 2015 11:58:08 -0400
Subject: Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
Authentication-results: sourceware.org; auth=none
References: <558EA828 dot 3080106 at web dot de> <20150628085753 dot GA4254 at domone>

On Sun, Jun 28, 2015 at 10:57:53AM +0200, OndÅej BÃlka wrote:
> Also you don't need table access at all for utf8 unless somebody adds
> locale with exotic characters. Just use utf8 codepoints. Here is how
> calculate them for 1-3 byte sequences for first 65536 indices and for
> 4-byte use trie.
> 
> 
>   if (p[0] < 0x80) {
>     return p[0];
>   } else if (p[0] < 0xE0) {
>     /* 2-byte sequence */
>     /* No need to check size due trailing 0.  */
>     return (p[0] << 6) + p[1] - 0x3080;
>   } else if (code_unit1 < 0xF0) {
>     /* 3-byte sequence */
>     if (size < 3) goto error;
>     return (p[0] << 12) + (p[1] << 6) + p[2] - 0xE2080;
>   } 

Maybe in the special case where you intend to use this it doesn't
matter, but in general this is unsafe because it assumes only legal
sequences appear.

Is there a reason you even need to get codepoints? One of the
properties of UTF-8 is that codepoints sort in the same order as code
units.

Rich

Follow-Ups:
- Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
  - From: OndÅej BÃlka

References:
- [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
  - From: Leonhard Holz
- Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]