This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PING^2][PATCH V3][BZ #18441] fix sorting multibyte charsets with an improper locale


Ping!

Am 13.07.2015 um 10:25 schrieb Leonhard Holz:
> Ping!
> 
> Am 06.07.2015 um 23:39 schrieb Leonhard Holz:
>> Patch v3: Replace _NL_CURRENT with _NL_CURRENT_WORD for reading the encoding.
>> Patch v2: Use the UTF-8 to codepoint conversion proposed by OndÅej.
>>
>> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
>> regression. The cause of the problem is that
>>
>> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
>> sort weight which causes the comparison to check the whole string instead of
>> breaking up early and
>>
>> b) the sequence-to-weight list is partitioned by the first byte of the first
>> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
>> they tend to have an equal starting byte (e.g. all thai chars start with E0).
>>
>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>> better hash key. My first try was to somehow "fold" a multibyte character into one
>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> The patch vastly improves the performance of languages with multibyte chars (see
>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>> get a bit slower because of the extra check for the first byte while finding the right
>> sequence in the sequence list . It cannot be avoided since the hash key is not
>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> filelist#C			  1.73%
>> filelist#en_US.UTF-8		  0.54%
>> lorem_ipsum#vi_VN.UTF-8		  1.90%
>> lorem_ipsum#ar_SA.UTF-8		-12.06%
>> lorem_ipsum#en_US.UTF-8		  1.15%
>> lorem_ipsum#zh_CN.UTF-8		-86.32%
>> lorem_ipsum#cs_CZ.UTF-8		-11.42%
>> lorem_ipsum#en_GB.UTF-8		- 3.09%
>> lorem_ipsum#da_DK.UTF-8		  6.70%
>> lorem_ipsum#pl_PL.UTF-8		- 1.04%
>> lorem_ipsum#fr_FR.UTF-8		- 1.22%
>> lorem_ipsum#pt_PT.UTF-8		  0.47%
>> lorem_ipsum#el_GR.UTF-8		-29.40%
>> lorem_ipsum#ru_RU.UTF-8		-11.79%
>> lorem_ipsum#iw_IL.UTF-8		- 1.39%
>> lorem_ipsum#es_ES.UTF-8		  3.91%
>> lorem_ipsum#hi_IN.UTF-8		-98.26%
>> lorem_ipsum#sv_SE.UTF-8		  5.61%
>> lorem_ipsum#hu_HU.UTF-8		 15.32%
>> lorem_ipsum#tr_TR.UTF-8		- 3.51%
>> lorem_ipsum#is_IS.UTF-8		  5.62%
>> lorem_ipsum#it_IT.UTF-8		-05.97%
>> lorem_ipsum#sr_RS.UTF-8		-01.19%
>> lorem_ipsum#ja_JP.UTF-8		-98.11%
>> wikipedia-th#en_US.UTF-8	-99.63%
>>
>>
>> 	* locale/programs/ld-collate.c (struct locale_collate_t):
>> 	Expand mbheads array from 256 to 16384 entries.
>> 	(collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
>> 	(collate_output): Output larger table and sequences including first byte.
>> 	* locale/weight.h (findidx): Use 2-byte key for table if UTF-8 locale.
>> 	* locale/weightwc.h (findidx): Accept encoding parameter, not used.
>> 	* posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
>> 	* posix/regcomp.c (build_equiv_class): Likewise.
>> 	* posix/regex_internal.h (re_string_elem_size_at): Likewise.
>> 	* posix/regexec.c (check_node_accept_bytes): Likewise.
>> 	* string/strcoll_l.c (get_next_seq): Likewise.
>> 	(STRCOLL): Call get_next_seq with encoding parameter.
>> 	* string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
>> 	(STRXFRM): Call find_idx with encoding parameter.
>>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]