This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale

From: Leonhard Holz <leonhard dot holz at web dot de>
To: OndÅej BÃlka <neleai at seznam dot cz>
Cc: libc-alpha at sourceware dot org
Date: Tue, 30 Jun 2015 07:52:21 +0200
Subject: Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
Authentication-results: sourceware.org; auth=none
References: <558EA828 dot 3080106 at web dot de> <20150628085753 dot GA4254 at domone>

> Just use a trie. You would avoid slow lists by precomputing it so you
> could do following:
> 
> struct trie
> {
>   int32_t index[256];
> }
> 
> int32_t
> findidx (unsigned char *p, struct trie *start)
> {
>   struct trie *t = start;
>   while (1)
>     {
>       unsigned char c = *p++;
>       if (t->index[c] >= 0)
>         return t->index[c];
>       else
>         t = start - t->index[c];
>     }
> }
> 
> 

I agree, but that goes way beyond the scope of the patch. I suggest to use the
patch given to solve the regression problem and refactor the datastructure as a
follow-up.

> Also you don't need table access at all for utf8 unless somebody adds
> locale with exotic characters. Just use utf8 codepoints. Here is how
> calculate them for 1-3 byte sequences for first 65536 indices and for
> 4-byte use trie.
> 
> 
>   if (p[0] < 0x80) {
>     return p[0];
>   } else if (p[0] < 0xE0) {
>     /* 2-byte sequence */
>     /* No need to check size due trailing 0.  */
>     return (p[0] << 6) + p[1] - 0x3080;
>   } else if (code_unit1 < 0xF0) {
>     /* 3-byte sequence */
>     if (size < 3) goto error;
>     return (p[0] << 12) + (p[1] << 6) + p[2] - 0xE2080;
>   } 
> 
> 

Actually the code does convert the UTF-8 sequences to codepoints and uses them as
hashkey while ignoring the upper bytes in the four-bytes-UTF-8 case:

+      if ((index & 0xE0) == 0xC0)
+	{
+	  if (len < 2)
+	    goto utf8_error;
+	  uint16_t byte2 = (*cpp)[1];
+	  index = ((index & 0x1F) << 6) | (byte2 & 0x3F);
+	}
+      else if ((index & 0xF0) == 0xE0)
+	{
+	  if (len < 3)
+	    goto utf8_error;
+	  uint16_t byte2 = (*cpp)[1];
+	  uint16_t byte3 = (*cpp)[2];
+	  index = ((index & 0xF) << 12) | ((byte2 & 0x3F) << 6) |
+		  (byte3 & 0x3F);
+	}
+      else if ((index & 0xF8) == 0xF0)
+	{
+	  if (len < 4)
+	    goto utf8_error;
+	  uint16_t byte3 = (*cpp)[2];
+	  uint16_t byte4 = (*cpp)[3];
+	  index = ((index & 0xF) << 12) | ((byte3 & 0x3F) << 6) |
+		  (byte4 & 0x3F);
+	}
+      else
+	{
+	utf8_error:
+	  *cpp += 1;
+	  return 0;
+	}
+    }

Maybe my conversion method is a bit "academic" but I think it is a reliable one.

Follow-Ups:
- Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
  - From: OndÅej BÃlka

References:
- [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
  - From: Leonhard Holz
- Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]