This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[PING^2][PATCH V3][BZ #18441] fix sorting multibyte charsets with an improper locale
- From: Leonhard Holz <leonhard dot holz at web dot de>
- To: libc-alpha at sourceware dot org
- Date: Mon, 20 Jul 2015 23:11:08 +0200
- Subject: [PING^2][PATCH V3][BZ #18441] fix sorting multibyte charsets with an improper locale
- Authentication-results: sourceware.org; auth=none
- References: <559AF57C dot 8010608 at web dot de> <55A37617 dot 6020502 at web dot de>
Ping!
Am 13.07.2015 um 10:25 schrieb Leonhard Holz:
> Ping!
>
> Am 06.07.2015 um 23:39 schrieb Leonhard Holz:
>> Patch v3: Replace _NL_CURRENT with _NL_CURRENT_WORD for reading the encoding.
>> Patch v2: Use the UTF-8 to codepoint conversion proposed by OndÅej.
>>
>> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
>> regression. The cause of the problem is that
>>
>> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
>> sort weight which causes the comparison to check the whole string instead of
>> breaking up early and
>>
>> b) the sequence-to-weight list is partitioned by the first byte of the first
>> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
>> they tend to have an equal starting byte (e.g. all thai chars start with E0).
>>
>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>> better hash key. My first try was to somehow "fold" a multibyte character into one
>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> The patch vastly improves the performance of languages with multibyte chars (see
>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>> get a bit slower because of the extra check for the first byte while finding the right
>> sequence in the sequence list . It cannot be avoided since the hash key is not
>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> filelist#C 1.73%
>> filelist#en_US.UTF-8 0.54%
>> lorem_ipsum#vi_VN.UTF-8 1.90%
>> lorem_ipsum#ar_SA.UTF-8 -12.06%
>> lorem_ipsum#en_US.UTF-8 1.15%
>> lorem_ipsum#zh_CN.UTF-8 -86.32%
>> lorem_ipsum#cs_CZ.UTF-8 -11.42%
>> lorem_ipsum#en_GB.UTF-8 - 3.09%
>> lorem_ipsum#da_DK.UTF-8 6.70%
>> lorem_ipsum#pl_PL.UTF-8 - 1.04%
>> lorem_ipsum#fr_FR.UTF-8 - 1.22%
>> lorem_ipsum#pt_PT.UTF-8 0.47%
>> lorem_ipsum#el_GR.UTF-8 -29.40%
>> lorem_ipsum#ru_RU.UTF-8 -11.79%
>> lorem_ipsum#iw_IL.UTF-8 - 1.39%
>> lorem_ipsum#es_ES.UTF-8 3.91%
>> lorem_ipsum#hi_IN.UTF-8 -98.26%
>> lorem_ipsum#sv_SE.UTF-8 5.61%
>> lorem_ipsum#hu_HU.UTF-8 15.32%
>> lorem_ipsum#tr_TR.UTF-8 - 3.51%
>> lorem_ipsum#is_IS.UTF-8 5.62%
>> lorem_ipsum#it_IT.UTF-8 -05.97%
>> lorem_ipsum#sr_RS.UTF-8 -01.19%
>> lorem_ipsum#ja_JP.UTF-8 -98.11%
>> wikipedia-th#en_US.UTF-8 -99.63%
>>
>>
>> * locale/programs/ld-collate.c (struct locale_collate_t):
>> Expand mbheads array from 256 to 16384 entries.
>> (collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
>> (collate_output): Output larger table and sequences including first byte.
>> * locale/weight.h (findidx): Use 2-byte key for table if UTF-8 locale.
>> * locale/weightwc.h (findidx): Accept encoding parameter, not used.
>> * posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
>> * posix/regcomp.c (build_equiv_class): Likewise.
>> * posix/regex_internal.h (re_string_elem_size_at): Likewise.
>> * posix/regexec.c (check_node_accept_bytes): Likewise.
>> * string/strcoll_l.c (get_next_seq): Likewise.
>> (STRCOLL): Call get_next_seq with encoding parameter.
>> * string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
>> (STRXFRM): Call find_idx with encoding parameter.
>>