This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale


On Sat, Jun 27, 2015 at 03:42:00PM +0200, Leonhard Holz wrote:
> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
> regression. The cause of the problem is that
> 
> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
> sort weight which causes the comparison to check the whole string instead of
> breaking up early and
> 
> b) the sequence-to-weight list is partitioned by the first byte of the first
> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
> they tend to have an equal starting byte (e.g. all thai chars start with E0).
> 
> The approach of the patch is to interprete TABLEMB as a hashtable and find a
> better hash key. My first try was to somehow "fold" a multibyte character into one
> byte but that worsened the overall performance a lot. Enhancing the table to 2
> byte keys works much better while needing a reasonable amount of extra memory.
> 
> The patch vastly improves the performance of languages with multibyte chars (see
> zh_CN and hi_IN below). A side effect is that languages with one-byte chars get a
> bit slower because of the extra check for the first byte while finding the right
> sequence in the sequence list . It cannot be avoided since the hash key is not
> longer equal to the first byte of the sequence. Tests are ok.
>
First could you check if utf overhead is noticable or not? Compile that
again with passing locale replaced by hardcoded utf and see if there is
noticable difference. It makes sense to force gcc create utf8 variant of
strcoll and other and do only one check at start of function. Hack above
would tell you impact.

> +	      else if ((index & 0xE0) == 0xC0)
> +		{
> +		  if (runp->nmbs < 2)
> +		    goto utf8_error;
> +		  uint16_t byte2 = ((unsigned char *) runp->mbs)[1];
> +		  index = ((index & 0x1F) << 6) | (byte2 & 0x3F);
> +		}
Masking makes no sense for hash function as it doesn't matter if there
is 0 or zero. Here problably just use
*((uint16_t *) runp->mbs)

> +	      else if ((index & 0xF0) == 0xE0)
> +		{
> +		  if (runp->nmbs < 3)
> +		    goto utf8_error;
> +		  uint16_t byte2 = ((unsigned char *) runp->mbs)[1];
> +		  uint16_t byte3 = ((unsigned char *) runp->mbs)[2];
> +		  index = ((index & 0xF) << 12) | ((byte2 & 0x3F) << 6) |
> +			  (byte3 & 0x3F)
Again here ands are unnecessary, when you use xor instead or. 
(index << 12) ^ *((uint16_t *) (runp->mbs+1));
> +		}
> +	      else if ((index & 0xF8) == 0xF0)
> +		{
> +		  if (runp->nmbs < 4)
> +		    goto utf8_error;
> +		  uint16_t byte3 = ((unsigned char *) runp->mbs)[2];
> +		  uint16_t byte4 = ((unsigned char *) runp->mbs)[3];
> +		  index = ((index & 0xF) << 12) | ((byte3 & 0x3F) << 6) |
> +			  (byte4 & 0x3F)
As you ignore second byte value you could use 
(index << 12) ^ *((uint16_t *) (runp->mbs+2));


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]