This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale


On Mon, Jun 29, 2015 at 11:58:08AM -0400, Rich Felker wrote:
> On Sun, Jun 28, 2015 at 10:57:53AM +0200, OndÅej BÃlka wrote:
> > Also you don't need table access at all for utf8 unless somebody adds
> > locale with exotic characters. Just use utf8 codepoints. Here is how
> > calculate them for 1-3 byte sequences for first 65536 indices and for
> > 4-byte use trie.
> > 
> > 
> >   if (p[0] < 0x80) {
> >     return p[0];
> >   } else if (p[0] < 0xE0) {
> >     /* 2-byte sequence */
> >     /* No need to check size due trailing 0.  */
> >     return (p[0] << 6) + p[1] - 0x3080;
> >   } else if (code_unit1 < 0xF0) {
> >     /* 3-byte sequence */
> >     if (size < 3) goto error;
> >     return (p[0] << 12) + (p[1] << 6) + p[2] - 0xE2080;
> >   } 
> 
> Maybe in the special case where you intend to use this it doesn't
> matter, but in general this is unsafe because it assumes only legal
> sequences appear.
>
And you couldn't spend five minutes to check that and cried wolf?
This is elementary as findidx doesn't return error, that makes result on
illegal sequence undefined so it doesn't matter. It couldn't crash, just
produces different invalid output.

Also callers fnmatch, regexec and strcoll dont return error on invalid
sequence. 
Anyway it couldn't be reliably detected without big slowdown in strcoll
as strdiff skips initial possibly invalid bytes.
 
> Is there a reason you even need to get codepoints? One of the
> properties of UTF-8 is that codepoints sort in the same order as code
> units.
> 
No special need for codepoint, but these are nice perfect hash function.
As you have 16777216 three byte sequences but just 65536 codepoints its
really noticable.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]