This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale

From: Carlos O'Donell <carlos at redhat dot com>
To: Leonhard Holz <leonhard dot holz at web dot de>, GNU C Library <libc-alpha at sourceware dot org>
Date: Tue, 29 Mar 2016 12:12:01 -0400
Subject: Re: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
Authentication-results: sourceware.org; auth=none
References: <56D3F8F0 dot 8070401 at web dot de>
On 02/29/2016 02:53 AM, Leonhard Holz wrote:
> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
> regression. The cause of the problem is that

I think we need a V5 with a few more changes as noted below:
- More comments.
- Change signature of utf8index.
- Probably one more define for magic 255 constant.

> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
> sort weight which causes the comparison to check the whole string instead of
> breaking up early and

OK.
 
> b) the sequence-to-weight list is partitioned by the first byte of the first
> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
> they tend to have an equal starting byte (e.g. all thai chars start with E0).

Agreed.

> The approach of the patch is to interprete TABLEMB as a hashtable and find a
> better hash key. My first try was to somehow "fold" a multibyte character into one
> byte but that worsened the overall performance a lot. Enhancing the table to 2
> byte keys works much better while needing a reasonable amount of extra memory.

How much memory on average?

Worst case is all possible 2-byte hash values?

e.g. 256*256*(sizeof(struct element_t)) == ~10MB?

Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?

> The patch vastly improves the performance of languages with multibyte chars (see
> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
> get a bit slower because of the extra check for the first byte while finding the right
> sequence in the sequence list . It cannot be avoided since the hash key is not
> longer equal to the first byte of the sequence. Tests are ok.

Can we use UTF-8-specific knowledge to accelerate the lookup?

For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.

Could you do two checks?

(a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
(b) If in a 1 byte sequence use a one-byte table.
(c) If in a 2-6 byte sequence use the hash-table?


> filelist#C			  1.75%		23,396,200	23,805,700
> filelist#en_US.UTF-8		  1.42%		77,186,200	78,285,200
> lorem_ipsum#vi_VN.UTF-8		 -1.70%		1,680,740	1,652,110
> lorem_ipsum#ar_SA.UTF-8		 -7.71%		2,134,780	1,970,170
> lorem_ipsum#en_US.UTF-8	 	  2.61%		1,685,120	1,729,160
> lorem_ipsum#zh_CN.UTF-8		-88.66%		806,176		91,423
> lorem_ipsum#cs_CZ.UTF-8		 -4.89%		2,150,120	2,045,030
> lorem_ipsum#en_GB.UTF-8		 -1.47%		2,061,960	2,031,620
> lorem_ipsum#da_DK.UTF-8		  3.15%		1,703,710	1,757,390
> lorem_ipsum#pl_PL.UTF-8		  0.86%		1,634,890	1,648,870
> lorem_ipsum#fr_FR.UTF-8		 -2.06%		2,232,030	2,186,030
> lorem_ipsum#pt_PT.UTF-8		 -2.60%		2,238,410	2,180,210
> lorem_ipsum#el_GR.UTF-8		-34.52%		3,413,330	2,235,010
> lorem_ipsum#ru_RU.UTF-8		 -9.88%		2,403,370	2,165,950
> lorem_ipsum#iw_IL.UTF-8		 -9.56%		2,209,740	1,998,500
> lorem_ipsum#es_ES.UTF-8	 	  4.92%		1,983,470	2,081,050
> lorem_ipsum#hi_IN.UTF-8		-98.88%		220,453,000	2,458,620
> lorem_ipsum#sv_SE.UTF-8		  1.79%		1,645,370	1,674,760
> lorem_ipsum#hu_HU.UTF-8		  4.86%		3,179,620	3,334,290
> lorem_ipsum#tr_TR.UTF-8		-23.59%		2,473,330	1,889,870
> lorem_ipsum#is_IS.UTF-8		  2.49%		1,620,370	1,660,680
> lorem_ipsum#it_IT.UTF-8		 -2.67%		2,186,160	2,127,710
> lorem_ipsum#sr_RS.UTF-8		  2.70%		1,930,520	1,982,720
> lorem_ipsum#ja_JP.UTF-8		-97.43%		958,411		24,664
> wikipedia-th#en_US.UTF-8	-99.61%		10,511,700,000	40,577,100
> 
> The performance numbers and the size of the patch changed due to the removal of the strdiff optimization (#18589) and
> the included thai test. Performance degration for locales in the ASCII plane is still minor. It does increase the speed
> of strcoll for all languages that mostly use multiple byte UTF-8 encoding a lot. Note that it should affect the regex
> performance of these languages too, though there is no benchmark for that.


>> Will this always work? I'm just wondering about a user generated charmap that they
>> call 'utf8', which is the other common alias for instance where the dash is not valid
>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
> 
> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".

Could you check to see what value 'code_set_name' uses internally?

Is it always 'UTF-8'? If it is, then the check you have is just fine.

Otherwise we should check for utf8 and UTF-8.

We don't know until you verify the values code_set_name can have.

> 
> 	* benchtests/bench-strcoll.c: Add thai text with en_US.UTF-8 locale.
> 	* benchtests/strcoll-inputs/wikipedia-th#en_US.UTF-8: New file.
> 	* locale/categories.def: Define _NL_COLLATE_ENCODING_TYPE.
> 	* locale/langinfo.h: Add _NL_COLLATE_ENCODING_TYPE to attribute list.
> 	* locale/localeinfo.h: Add enum collation_encoding_type.
> 	* locale/C-collate.c: Set _NL_COLLATE_ENCODING_TYPE to 8bit.
> 	* locale/programs/ld-collate.c (struct locale_collate_t):
> 	Expand mbheads array from 256 to 16384 entries.
> 	(collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
> 	(collate_output): Output larger table and sequences including first byte.
> 	(collate_output): Add encoding type info.
> 	* locale/weight.h (utf8index): New function to calculate 2 byte index.
> 	(findidx): Use 2-byte index for table if UTF-8 locale.
> 	* locale/weightwc.h (findidx): Accept encoding parameter, not used.
>  	* posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
> 	* posix/regcomp.c (build_equiv_class): Likewise.
> 	* posix/regex_internal.h (re_string_elem_size_at): Likewise.
> 	* posix/regexec.c (check_node_accept_bytes): Likewise.
> 	* string/strcoll_l.c (get_next_seq): Likewise.
> 	(STRCOLL): Call get_next_seq with encoding parameter.
> 	* string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
> 	(STRXFRM): Call find_idx with encoding parameter.
> 
> 
> diff --git a/benchtests/bench-strcoll.c b/benchtests/bench-strcoll.c
> index 22ae87c..6ce5b2a 100644
> --- a/benchtests/bench-strcoll.c
> +++ b/benchtests/bench-strcoll.c
> @@ -53,7 +53,8 @@ static const char *const input_files[] = {
>    "lorem_ipsum#is_IS.UTF-8",
>    "lorem_ipsum#it_IT.UTF-8",
>    "lorem_ipsum#sr_RS.UTF-8",
> -  "lorem_ipsum#ja_JP.UTF-8"
> +  "lorem_ipsum#ja_JP.UTF-8",
> +  "wikipedia-th#en_US.UTF-8"

OK.

>  };
> 
>  #define TEXTFILE_DELIMITER " \n\r\t.,?!"
> diff --git a/locale/C-collate.c b/locale/C-collate.c
> index 8214ff5..5a9ed6a 100644
> --- a/locale/C-collate.c
> +++ b/locale/C-collate.c
> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>      /* _NL_COLLATE_COLLSEQWC */
>      { .string = (const char *) collseqwc },
>      /* _NL_COLLATE_CODESET */
> -    { .string = _nl_C_codeset }
> +    { .string = _nl_C_codeset },
> +    /* _NL_COLLATE_ENCODING_TYPE */
> +    { .word = __cet_8bit }

This makes locale-archive incompatible again right?

Users have to regenerate the locale-archive after the upgrade?

We need a release note for that under "Packaging Changes"
https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes

The release note should mention the binary locale-archive format
has changed and that locale-archive must be removed before upgrading
and then recompiled after upgrade.

>    }
>  };
> diff --git a/locale/categories.def b/locale/categories.def
> index d8a3ab8..cb57eae 100644
> --- a/locale/categories.def
> +++ b/locale/categories.def
> @@ -58,6 +58,7 @@ DEFINE_CATEGORY
>    DEFINE_ELEMENT (_NL_COLLATE_COLLSEQMB,        "collate-collseqmb",        std, wstring)
>    DEFINE_ELEMENT (_NL_COLLATE_COLLSEQWC,        "collate-collseqwc",        std, wstring)
>    DEFINE_ELEMENT (_NL_COLLATE_CODESET,		"collate-codeset",	    std, string)
> +  DEFINE_ELEMENT (_NL_COLLATE_ENCODING_TYPE,	"collate-encoding-type",    std, word)

OK.

>    ), NO_POSTLOAD)
> 
> 
> diff --git a/locale/langinfo.h b/locale/langinfo.h
> index 481e226..0906a6a 100644
> --- a/locale/langinfo.h
> +++ b/locale/langinfo.h
> @@ -255,6 +255,7 @@ enum
>    _NL_COLLATE_COLLSEQMB,
>    _NL_COLLATE_COLLSEQWC,
>    _NL_COLLATE_CODESET,
> +  _NL_COLLATE_ENCODING_TYPE,

OK.

>    _NL_NUM_LC_COLLATE,
> 
>    /* LC_CTYPE category: character classification.
> diff --git a/locale/localeinfo.h b/locale/localeinfo.h
> index 5c4e6ef..bd284df 100644
> --- a/locale/localeinfo.h
> +++ b/locale/localeinfo.h
> @@ -110,6 +110,14 @@ enum coll_sort_rule
>    sort_mask
>  };
> 
> +/* Collation encoding type.  */
> +enum collation_encoding_type
> +{
> +  __cet_other,
> +  __cet_8bit,
> +  __cet_utf8
> +};

OK.

> +
>  /* We can map the types of the entries into a few categories.  */
>  enum value_type
>  {
> diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
> index 1e125f6..efaacf6 100644
> --- a/locale/programs/ld-collate.c
> +++ b/locale/programs/ld-collate.c
> @@ -32,6 +32,8 @@
>  #include "linereader.h"
>  #include "locfile.h"
>  #include "elem-hash.h"
> +#include "../localeinfo.h"
> +#include "../locale/weight.h"

OK.

> 
>  /* Uncomment the following line in the production version.  */
>  /* #define NDEBUG 1 */
> @@ -243,9 +245,10 @@ struct locale_collate_t
>       Therefore we keep all relevant input in a list.  */
>    struct locale_collate_t *next;
> 
> -  /* Arrays with heads of the list for each of the leading bytes in
> +  /* Arrays with heads of the list for the leading bytes in
>       the multibyte sequences.  */
> -  struct element_t *mbheads[256];

Needs a comment explaining why this is '256*256'.

> +  #define MBHEADS_SZ (256 * 256)
> +  struct element_t *mbheads[MBHEADS_SZ];

OK.

> 
>    /* Arrays with heads of the list for each of the leading bytes in
>       the multibyte sequences.  */
> @@ -1557,6 +1560,7 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
>    struct section_list *sect;
>    int ruleidx;
>    int nr_wide_elems = 0;
> +  bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0;

As discussed above you need to verify that code_set_name is always "UTF-8" if it's
a UTF-8 charmap. If it is always "UTF-8" then this check is just fine. I will note
that `locale -a -v` always prints 'UTF-8' for codeset and never 'utf8', but the
printed name is still 'C.utf8' or 'es_EC.utf8' etc because glibc normalizes the name
internally (downcase and strip dashes e.g. pt_PT.iso88591).

I suggest just putting a breakpoint there and verify that it's UTF-8 for a few
locale names.

You could then avoid strcmp:

bool is_utf8 == strlen (charmap->code_set_name) == 5 
		? (charmap->code_set_name[0] == 'U'
		   && charmap->code_set_name[1] == 'T'
		   && charmap->code_set_name[2] == 'F'
		   && charmap->code_set_name[3] == '-'
		   && charmap->code_set_name[4] == '8')
		: 0;

Since more than 5 chars means it's not 'UTF-8' e.g. BIG5-HKSCS, or ISO-8859-1.
The only locales with 5 or less are GBK and PT154. So you could skip the strcmp
in all cases, but still need a comparison to disambiguate PT154 vs. UTF-8.

> 
>    if (collate == NULL)
>      {
> @@ -1663,7 +1667,22 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
>  	  struct element_t *lastp = NULL;
> 
>  	  /* Find the point where to insert in the list.  */
> -	  eptr = &collate->mbheads[((unsigned char *) runp->mbs)[0]];
> +	  uint16_t index = ((unsigned char *) runp->mbs)[0];
> +
> +	  /* Special handling of UTF-8: Generate a 2-byte index to mbheads.  */
> +	  if (is_utf8 && index > 0)
> +	    {
> +	      index = utf8index((unsigned char *) runp->mbs, runp->nmbs);
> +	      if (index == 0)
> +		{
> +		  WITH_CUR_LOCALE (error_at_line (0, 0, runp->file, runp->line,
> +						  _("\
> +malformed UTF-8 character in `%s'"), runp->name););
> +		  goto dont_insert;
> +		}
> +	    }
> +
> +	  eptr = &collate->mbheads[index];

OK.

>  	  while (*eptr != NULL)
>  	    {
>  	      if ((*eptr)->nmbs < runp->nmbs)
> @@ -1734,7 +1753,7 @@ symbol `%s' has the same encoding as"), (*eptr)->name);
> 
>    /* Find out whether any of the `mbheads' entries is unset.  In this
>       case we use the UNDEFINED entry.  */
> -  for (i = 1; i < 256; ++i)
> +  for (i = 1; i < MBHEADS_SZ; ++i)

OK.

>      if (collate->mbheads[i] == NULL)
>        {
>  	need_undefined = 1;
> @@ -2107,7 +2126,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>    const size_t nelems = _NL_ITEM_INDEX (_NL_NUM_LC_COLLATE);
>    struct locale_file file;
>    size_t ch;
> -  int32_t tablemb[256];
> +  int32_t tablemb[MBHEADS_SZ];

OK.

>    struct obstack weightpool;
>    struct obstack extrapool;
>    struct obstack indirectpool;
> @@ -2130,6 +2149,8 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  	  /* The words have to be handled specially.  */
>  	  if (idx == _NL_ITEM_INDEX (_NL_COLLATE_SYMB_HASH_SIZEMB))
>  	    add_locale_uint32 (&file, 0);
> +	  else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE))
> +	    add_locale_uint32 (&file, __cet_other);

OK.

>  	  else
>  	    add_locale_empty (&file);
>  	}
> @@ -2183,7 +2204,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>    if (collate->undefined.used_in_level != 0)
>      output_weight (&weightpool, collate, &collate->undefined);
> 
> -  for (ch = 1; ch < 256; ++ch)
> +  for (ch = 1; ch < MBHEADS_SZ; ++ch)

OK.

>      if (collate->mbheads[ch]->mbnext == NULL
>  	&& collate->mbheads[ch]->nmbs <= 1)
>        {
> @@ -2208,7 +2229,6 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  	   and add only one index into the weight table.  We can find the
>  	   consecutive entries since they are also consecutive in the list.  */
>  	struct element_t *runp = collate->mbheads[ch];
> -	struct element_t *lastp;
> 
>  	assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
> 
> @@ -2236,7 +2256,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> 
>  		/* Compute how much space we will need.  */
>  		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
> -					  + 2 * (runp->nmbs - 1));
> +					  + 2 * runp->nmbs);
>  		assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
>  		obstack_make_room (&extrapool, added);
> 
> @@ -2259,9 +2279,9 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  		/* Now walk backward from here to the beginning.  */
>  		curp = runp;
> 
> -		assert (runp->nmbs <= 256);
> -		obstack_1grow_fast (&extrapool, curp->nmbs - 1);
> -		for (i = 1; i < curp->nmbs; ++i)
> +		assert (runp->nmbs <= 255);

Another magic constant. Where does 256 or 255 come from?

These are struct element_t structures with entries chained from there.

Why do we limit them to 255? When might they be larger?

We should use a define here and document why the limit exists.

> +		obstack_1grow_fast (&extrapool, curp->nmbs);
> +		for (i = 0; i < curp->nmbs; ++i)

OK.

>  		  obstack_1grow_fast (&extrapool, curp->mbs[i]);
> 
>  		/* Now find the end of the consecutive sequence and
> @@ -2281,7 +2301,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> 
>  		/* And add the end byte sequence.  Without length this
>  		   time.  */
> -		for (i = 1; i < curp->nmbs; ++i)
> +		for (i = 0; i < curp->nmbs; ++i)
>  		  obstack_1grow_fast (&extrapool, curp->mbs[i]);
>  	      }
>  	    else
> @@ -2295,15 +2315,15 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  		weightidx = output_weight (&weightpool, collate, runp);
> 
>  		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
> -					  + runp->nmbs - 1);
> +					  + runp->nmbs);
>  		assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
>  		obstack_make_room (&extrapool, added);
> 
>  		obstack_int32_grow_fast (&extrapool, weightidx);
> -		assert (runp->nmbs <= 256);
> -		obstack_1grow_fast (&extrapool, runp->nmbs - 1);
> +		assert (runp->nmbs <= 255);
> +		obstack_1grow_fast (&extrapool, runp->nmbs);
> 
> -		for (i = 1; i < runp->nmbs; ++i)
> +		for (i = 0; i < runp->nmbs; ++i)
>  		  obstack_1grow_fast (&extrapool, runp->mbs[i]);
>  	      }
> 
> @@ -2312,30 +2332,25 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  	      obstack_1grow_fast (&extrapool, '\0');
> 
>  	    /* Next entry.  */
> -	    lastp = runp;
>  	    runp = runp->mbnext;
>  	  }
>  	while (runp != NULL);
> 
>  	assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
> 
> -	/* If the final entry in the list is not a single character we
> -	   add an UNDEFINED entry here.  */
> -	if (lastp->nmbs != 1)
> -	  {
> -	    int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
> -	    obstack_make_room (&extrapool, added);
> +	/* Add an UNDEFINED entry at the end of the list.  */
> +	int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
> +	obstack_make_room (&extrapool, added);
> 
> -	    obstack_int32_grow_fast (&extrapool, 0);
> -	    /* XXX What rule? We just pick the first.  */
> -	    obstack_1grow_fast (&extrapool, 0);
> -	    /* Length is zero.  */
> -	    obstack_1grow_fast (&extrapool, 0);
> +	obstack_int32_grow_fast (&extrapool, 0);
> +	/* XXX What rule? We just pick the first.  */
> +	obstack_1grow_fast (&extrapool, 0);
> +	/* Length is zero.  */
> +	obstack_1grow_fast (&extrapool, 0);
> 
> -	    /* Add alignment bytes if necessary.  */
> -	    while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
> -	      obstack_1grow_fast (&extrapool, '\0');
> -	  }
> +	/* Add alignment bytes if necessary.  */
> +	while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
> +	  obstack_1grow_fast (&extrapool, '\0');
>        }
> 
>    /* Add padding to the tables if necessary.  */
> @@ -2343,7 +2358,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>      obstack_1grow (&weightpool, 0);
> 
>    /* Now add the four tables.  */
> -  add_locale_uint32_array (&file, (const uint32_t *) tablemb, 256);
> +  add_locale_uint32_array (&file, (const uint32_t *) tablemb, MBHEADS_SZ);
>    add_locale_raw_obstack (&file, &weightpool);
>    add_locale_raw_obstack (&file, &extrapool);
>    add_locale_raw_obstack (&file, &indirectpool);
> @@ -2493,6 +2508,12 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>    add_locale_raw_data (&file, collate->mbseqorder, 256);
>    add_locale_collseq_table (&file, &collate->wcseqorder);
>    add_locale_string (&file, charmap->code_set_name);
> +  if (strcmp (charmap->code_set_name, "UTF-8") == 0)

Similar discussion as above regarding UTF-8 and optimziation.

> +    add_locale_uint32 (&file, __cet_utf8);
> +  else if (charmap->mb_cur_max == 1)
> +    add_locale_uint32 (&file, __cet_8bit);
> +  else
> +    add_locale_uint32 (&file, __cet_other);
>    write_locale_data (output_path, LC_COLLATE, "LC_COLLATE", &file);
> 
>    obstack_free (&weightpool, NULL);
> diff --git a/locale/weight.h b/locale/weight.h
> index c99730c..5b4103b 100644
> --- a/locale/weight.h
> +++ b/locale/weight.h
> @@ -19,26 +19,81 @@
>  #ifndef _WEIGHT_H_
>  #define _WEIGHT_H_	1
> 
> +/* Generate 2 byte code for the next UTF-8 encoded char.
> +   Returns zero on UTF-8 encoding errors.  */

Plase change the prototype to be:

int utf8index (const unsigned char *cp, size_t len, uint16_t *index);

Return 0 on success.
Return !0 on error.
Store index into *index.

This way we avoid conflating index value 0 and error.

> +static __always_inline uint16_t
> +utf8index (const unsigned char *cp, size_t len)
> +{
> +  uint16_t index = cp[0];
> +
> +  if (index >= 0x80)
> +    {
> +      if (index < 0xE0)
> +	{
> +	  if (len < 2)
> +	    return 0;
> +	  uint16_t byte2 = cp[1];
> +	  index = (index << 6) + byte2 - 0x3080;
> +	}
> +      else if (index < 0xF0)
> +	{
> +	  if (len < 3)
> +	    return 0;
> +	  uint16_t byte2 = cp[1];
> +	  uint16_t byte3 = cp[2];
> +	  index = (index << 12) + (byte2 << 6) + byte3 - 0xE2080;
> +	}
> +      else if (index < 0xF8)
> +	{
> +	  if (len < 4)
> +	    return 0;
> +	  uint16_t byte2 = cp[1];
> +	  uint16_t byte3 = cp[2];
> +	  uint16_t byte4 = cp[3];
> +	  index = (byte2 << 12) + (byte3 << 6) + byte4 - 0x82080;

I believe this is technically inaccurate since it allows all 4-byte
sequences, when in reality the limit is at U+10FFFF?

You need not fix it, but we should add a comment saying that for the
sake of simpler code we're allowing those 4-byte sequences which are
not normally accepted.

> +	}
> +      else
> +	return 0;

OK.

> +    }
> +
> +  return index;
> +}
> +
>  /* Find index of weight.  */
>  static inline int32_t __attribute__ ((always_inline))
> -findidx (const int32_t *table,
> +findidx (uint_fast32_t locale_encoding,
> +	 const int32_t *table,
>  	 const int32_t *indirect,
>  	 const unsigned char *extra,
>  	 const unsigned char **cpp, size_t len)
>  {
> -  int_fast32_t i = table[*(*cpp)++];
>    const unsigned char *cp;
>    const unsigned char *usrc;
> +  uint16_t index = (*cpp)[0];
> +
> +  /* Special handling of UTF-8: Generate a 2-byte index for table.  */
> +  if (index >= 0x80 && locale_encoding == __cet_utf8)
> +    {
> +      index = utf8index(*cpp, len);
> +      if (index == 0)
> +	{
> +	  *cpp += 1;
> +	  return 0;
> +	}
> +    }

OK.

> 
> +  int_fast32_t i = table[index];
>    if (i >= 0)
> -    /* This is an index into the weight table.  Cool.  */
> -    return i;
> +    {
> +      /* This is an index into the weight table.  Cool.  */
> +      *cpp += 1;
> +      return i;
> +    }
> 
>    /* Oh well, more than one sequence starting with this byte.
>       Search for the correct one.  */
>    cp = &extra[-i];
>    usrc = *cpp;
> -  --len;
>    while (1)
>      {
>        size_t nhere;
> @@ -57,8 +112,7 @@ findidx (const int32_t *table,
>  	  /* It is a single character.  If it matches we found our
>  	     index.  Note that at the end of each list there is an
>  	     entry of length zero which represents the single byte
> -	     sequence.  The first (and here only) byte was tested
> -	     already.  */
> +	     sequence.  */
>  	  size_t cnt;
> 
>  	  for (cnt = 0; cnt < nhere && cnt < len; ++cnt)
> @@ -68,7 +122,7 @@ findidx (const int32_t *table,
>  	  if (cnt == nhere)
>  	    {
>  	      /* Found it.  */
> -	      *cpp += nhere;
> +	      *cpp += nhere > 0 ? nhere : 1;
>  	      return i;
>  	    }
> 
> @@ -127,7 +181,7 @@ findidx (const int32_t *table,
>  	      while (++cnt < nhere);
>  	    }
> 
> -	  *cpp += nhere;
> +	  *cpp += nhere > 0 ? nhere : 1;
>  	  return indirect[-i + offset];
>  	}
>      }

OK.

> diff --git a/locale/weightwc.h b/locale/weightwc.h
> index ab26482..4101dc8 100644
> --- a/locale/weightwc.h
> +++ b/locale/weightwc.h
> @@ -21,7 +21,8 @@
> 
>  /* Find index of weight.  */
>  static inline int32_t __attribute__ ((always_inline))
> -findidx (const int32_t *table,
> +findidx (uint_fast32_t encoding,
> +	 const int32_t *table,
>  	 const int32_t *indirect,
>  	 const wint_t *extra,
>  	 const wint_t **cpp, size_t len)
> diff --git a/posix/fnmatch_loop.c b/posix/fnmatch_loop.c
> index 229904e..07b60fb 100644
> --- a/posix/fnmatch_loop.c
> +++ b/posix/fnmatch_loop.c
> @@ -383,6 +383,8 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
>  			const int32_t *indirect;
>  			int32_t idx;
>  			const UCHAR *cp = (const UCHAR *) &str;
> +			uint_fast32_t encoding =
> +			  _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
> 
>  # if WIDE_CHAR_VERSION
>  			table = (const int32_t *)
> @@ -404,7 +406,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
>  			  _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
>  # endif
> 
> -			idx = FINDIDX (table, indirect, extra, &cp, 1);
> +			idx = FINDIDX (encoding, table, indirect, extra, &cp, 1);
>  			if (idx != 0)
>  			  {
>  			    /* We found a table entry.  Now see whether the
> @@ -414,7 +416,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
>  			    int32_t idx2;
>  			    const UCHAR *np = (const UCHAR *) n;
> 
> -			    idx2 = FINDIDX (table, indirect, extra,
> +			    idx2 = FINDIDX (encoding, table, indirect, extra,
>  					    &np, string_end - n);
>  			    if (idx2 != 0
>  				&& (idx >> 24) == (idx2 >> 24)

OK.

> diff --git a/posix/regcomp.c b/posix/regcomp.c
> index b6126b7..011ef92 100644
> --- a/posix/regcomp.c
> +++ b/posix/regcomp.c
> @@ -3414,6 +3414,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
>    uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
>    if (nrules != 0)
>      {
> +      uint_fast32_t encoding;
>        const int32_t *table, *indirect;
>        const unsigned char *weights, *extra, *cp;
>        unsigned char char_buf[2];
> @@ -3422,6 +3423,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
>        size_t len;
>        /* Calculate the index for equivalence class.  */
>        cp = name;
> +      encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
>        table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
>        weights = (const unsigned char *) _NL_CURRENT (LC_COLLATE,
>  					       _NL_COLLATE_WEIGHTMB);
> @@ -3429,7 +3431,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
>  						   _NL_COLLATE_EXTRAMB);
>        indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
>  						_NL_COLLATE_INDIRECTMB);
> -      idx1 = findidx (table, indirect, extra, &cp, -1);
> +      idx1 = findidx (encoding, table, indirect, extra, &cp, -1);
>        if (BE (idx1 == 0 || *cp != '\0', 0))
>  	/* This isn't a valid character.  */
>  	return REG_ECOLLATE;
> @@ -3440,7 +3442,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
>  	{
>  	  char_buf[0] = ch;
>  	  cp = char_buf;
> -	  idx2 = findidx (table, indirect, extra, &cp, 1);
> +	  idx2 = findidx (encoding, table, indirect, extra, &cp, 1);
>  /*
>  	  idx2 = table[ch];
>  */

OK

> diff --git a/posix/regex_internal.h b/posix/regex_internal.h
> index 02e040b..993c7c3 100644
> --- a/posix/regex_internal.h
> +++ b/posix/regex_internal.h
> @@ -743,17 +743,19 @@ re_string_elem_size_at (const re_string_t *pstr, int idx)
>  #  ifdef _LIBC
>    const unsigned char *p, *extra;
>    const int32_t *table, *indirect;
> +  uint_fast32_t encoding;
>    uint_fast32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
> 
>    if (nrules != 0)
>      {
> +      encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
>        table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
>        extra = (const unsigned char *)
>  	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
>        indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
>  						_NL_COLLATE_INDIRECTMB);
>        p = pstr->mbs + idx;
> -      findidx (table, indirect, extra, &p, pstr->len - idx);
> +      findidx (encoding, table, indirect, extra, &p, pstr->len - idx);
>        return p - pstr->mbs - idx;
>      }
>    else

OK.

> diff --git a/posix/regexec.c b/posix/regexec.c
> index ec46c3a..3d3ad9a 100644
> --- a/posix/regexec.c
> +++ b/posix/regexec.c
> @@ -3843,6 +3843,7 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
>        if (nrules != 0)
>  	{
>  	  unsigned int in_collseq = 0;
> +	  uint_fast32_t encoding;
>  	  const int32_t *table, *indirect;
>  	  const unsigned char *weights, *extra;
>  	  const char *collseqwc;
> @@ -3893,6 +3894,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
>  	  if (cset->nequiv_classes)
>  	    {
>  	      const unsigned char *cp = pin;
> +	      encoding =
> +		_NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
>  	      table = (const int32_t *)
>  		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
>  	      weights = (const unsigned char *)
> @@ -3901,7 +3904,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
>  		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
>  	      indirect = (const int32_t *)
>  		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
> -	      int32_t idx = findidx (table, indirect, extra, &cp, elem_len);
> +	      int32_t idx = findidx (encoding, table, indirect, extra, &cp,
> +				     elem_len);
>  	      if (idx > 0)
>  		for (i = 0; i < cset->nequiv_classes; ++i)
>  		  {

OK.

> diff --git a/string/strcoll_l.c b/string/strcoll_l.c
> index 4d1e3ab..2c2cab0 100644
> --- a/string/strcoll_l.c
> +++ b/string/strcoll_l.c
> @@ -63,9 +63,9 @@ typedef struct
>  /* Get next sequence.  Traverse the string as required.  */
>  static __always_inline void
>  get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
> -	      const USTRING_TYPE *weights, const int32_t *table,
> -	      const USTRING_TYPE *extra, const int32_t *indirect,
> -	      int pass)
> +	      const USTRING_TYPE *weights, uint_fast32_t encoding,
> +	      const int32_t *table, const USTRING_TYPE *extra,
> +	      const int32_t *indirect, int pass)
>  {
>    size_t val = seq->val = 0;
>    int len = seq->len;
> @@ -109,7 +109,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
>  	      us = seq->back_us;
>  	      while (i < backw)
>  		{
> -		  int32_t tmp = findidx (table, indirect, extra, &us, -1);
> +		  int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
>  		  idx = tmp & 0xffffff;
>  		  i++;
>  		}
> @@ -124,7 +124,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
> 
>  	  while (*us != L('\0'))
>  	    {
> -	      int32_t tmp = findidx (table, indirect, extra, &us, -1);
> +	      int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
>  	      unsigned char rule = tmp >> 24;
>  	      prev_idx = idx;
>  	      idx = tmp & 0xffffff;
> @@ -253,6 +253,7 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
>    const USTRING_TYPE *weights;
>    const USTRING_TYPE *extra;
>    const int32_t *indirect;
> +  uint_fast32_t encoding;
> 
>    if (nrules == 0)
>      return STRCMP (s1, s2);
> @@ -271,6 +272,8 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
>      current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_EXTRA,SUFFIX))].string;
>    indirect = (const int32_t *)
>      current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_INDIRECT,SUFFIX))].string;
> +  encoding = current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
> +
> 
>    assert (((uintptr_t) table) % __alignof__ (table[0]) == 0);
>    assert (((uintptr_t) weights) % __alignof__ (weights[0]) == 0);
> @@ -310,9 +313,9 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
> 
>        while (1)
>  	{
> -	  get_next_seq (&seq1, nrules, rulesets, weights, table,
> +	  get_next_seq (&seq1, nrules, rulesets, weights, encoding, table,
>  				    extra, indirect, pass);
> -	  get_next_seq (&seq2, nrules, rulesets, weights, table,
> +	  get_next_seq (&seq2, nrules, rulesets, weights, encoding, table,
>  				    extra, indirect, pass);
>  	  /* See whether any or both strings are empty.  */
>  	  if (seq1.len == 0 || seq2.len == 0)

OK.

> diff --git a/string/strxfrm_l.c b/string/strxfrm_l.c
> index 22e24d3..5c89b15 100644
> --- a/string/strxfrm_l.c
> +++ b/string/strxfrm_l.c
> @@ -53,6 +53,7 @@ typedef struct
>    uint_fast32_t nrules;
>    unsigned char *rulesets;
>    USTRING_TYPE *weights;
> +  uint_fast32_t encoding;
>    int32_t *table;
>    USTRING_TYPE *extra;
>    int32_t *indirect;
> @@ -100,8 +101,8 @@ static __always_inline size_t
>  find_idx (const USTRING_TYPE **us, int32_t *weight_idx,
>  	  unsigned char *rule_idx, const locale_data_t *l_data, const int pass)
>  {
> -  int32_t tmp = findidx (l_data->table, l_data->indirect, l_data->extra, us,
> -			 -1);
> +  int32_t tmp = findidx (l_data->encoding, l_data->table, l_data->indirect,
> +			 l_data->extra, us, -1);
>    *rule_idx = tmp >> 24;
>    int32_t idx = tmp & 0xffffff;
>    size_t len = l_data->weights[idx++];
> @@ -693,6 +694,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)
>    /* Get the locale data.  */
>    l_data.rulesets = (unsigned char *)
>      current->values[_NL_ITEM_INDEX (_NL_COLLATE_RULESETS)].string;
> +  l_data.encoding =
> +    current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
>    l_data.table = (int32_t *)
>      current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_TABLE,SUFFIX))].string;
>    l_data.weights = (USTRING_TYPE *)
> @@ -721,8 +724,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)
> 
>    do
>      {
> -      int32_t tmp = findidx (l_data.table, l_data.indirect, l_data.extra, &cur,
> -			     -1);
> +      int32_t tmp = findidx (l_data.encoding, l_data.table, l_data.indirect,
> +			     l_data.extra, &cur, -1);
>        rulearr[idxmax] = tmp >> 24;
>        idxarr[idxmax] = tmp & 0xffffff;
> 

OK.

-- 
Cheers,
Carlos.
Follow-Ups:
- Re: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
  - From: Joseph Myers
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]