This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale


On Sun, Jun 28, 2015 at 10:57:53AM +0200, OndÅej BÃlka wrote:
> Also you don't need table access at all for utf8 unless somebody adds
> locale with exotic characters. Just use utf8 codepoints. Here is how
> calculate them for 1-3 byte sequences for first 65536 indices and for
> 4-byte use trie.
> 
> 
>   if (p[0] < 0x80) {
>     return p[0];
>   } else if (p[0] < 0xE0) {
>     /* 2-byte sequence */
>     /* No need to check size due trailing 0.  */
>     return (p[0] << 6) + p[1] - 0x3080;
>   } else if (code_unit1 < 0xF0) {
>     /* 3-byte sequence */
>     if (size < 3) goto error;
>     return (p[0] << 12) + (p[1] << 6) + p[2] - 0xE2080;
>   } 

Maybe in the special case where you intend to use this it doesn't
matter, but in general this is unsafe because it assumes only legal
sequences appear.

Is there a reason you even need to get codepoints? One of the
properties of UTF-8 is that codepoints sort in the same order as code
units.

Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]