This is the mail archive of the
mailing list for the GNU libc locales project.
Re: [PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.
- From: Luis Javier Merino <ninjalj at gmail dot com>
- To: Egmont Koblinger <egmont at gmail dot com>
- Cc: "Carlos O'Donell" <carlos at redhat dot com>, libc-locales <libc-locales at sourceware dot org>
- Date: Sun, 5 Feb 2017 13:16:40 +0100
- Subject: Re: [PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.
- Authentication-results: sourceware.org; auth=none
- References: <CAGWcZkLbhdJWRZLDKHXrHf2875pKLushYJon7YusGu=zhpO7mQ@mail.gmail.com> <CAGWcZkLsGUcfmw6X4VT7sWZX5juh5WFkJe=ChV+K2myjDmbuEA@mail.gmail.com> <20160421061349.GM5369@vapier.lan> <CAGWcZkK=UXGDEG6moxcyG9PJz4D=V=kVR6G1u=uhSFqgu+m+oA@mail.gmail.com> <CAGWcZkLyq8XJ5utRbZ6A58BhpdZdhrAi7m-TGa_W367ymKofirstname.lastname@example.org> <email@example.com> <CAGWcZkJ_m26UxF=+P7U-Kdw+6msvTE_e=TQNOt-_F1zihjheAQ@mail.gmail.com> <CABjvSdgNJRDUNBOExm9=Sgyydre4jLyV9hdp+6Z-kom-y9jKOw@mail.gmail.com> <CAGWcZk+iQCPg0uVw40Mh8TPzE+yus6j3X3S=gQHCeTBN+p2bUA@mail.gmail.com> <firstname.lastname@example.org> <CAGWcZk+jjq2VXK00Fn=t-xtt=ek7yvxegKc3+_my=XZFdcuDXQ@mail.gmail.com>
I've had a further look at Egmont's patch. It does the following:
- It reverts b008d4c (the "fix" for BZ#13547, which broke collation in
other ways). Reverting this brings collation more in line with ICU.
- It defines DIACRIT_FORWARD. This brings collation more in line with ICU.
- It fixes BZ#18587, defining collating symbols <MIN-MIN> and
<CAP-CAP>. Before, collation went cs (<MIN>) < cS (<MIN-CAP>) < CS
(<CAP>) < Cs (<CAP-MIN>). After, it goes cs (<MIN-MIN>) < cS
(<MIN-CAP>) < Cs (<CAP-MIN>) < CS (<CAP-CAP>). This brings collation a
little more in line with ICU.
- It introduces <SINGLE_OR_COMPOUND> and <COMPOUND> collating symbols,
and assigns secondary weights to digraphs/trigraphs and contracted
digraphs/trigraphs using them. <SINGLE_OR_COMPOUND> is ordered before
<COMPOUND>, which makes short forms collate belong long forms. b008d4c
already made short forms collate before long forms, by ordering
<c_or_cs> and the like before <cs> and the like. ICU doesn't collate
long forms before short forms until level 3. Perl collates them
stably, i.e. just as they appear in the input. In any case, ordering
<COMPOUND> before <SINGLE_OR_COMPOUND> would give ICU's ordering,
which I'm not at all sure it's better. Applying Egmont's patch doesn't
divert from ICU further than b008d4c did, and fixes other things.
I've noticed another difference with respect to ICU:
- When a word appears both with and without hyphen (pingpong and
ping-pong), they collate differently. This probably applies to all
glibc locales. ICU probably changes ordering when selecting a
different algorithm for variable weighings: Perl gives glibc ordering
(hyphenated word before non-hyphenated word) for "Shifted" and
"Non-Ignorable", the opposite ordering for "Shift-Trimmed" and
So, to recap the other differences to ICU:
- ICU sorts long forms before short forms at L3. Perl collates as per
the input ordering. This can be changed in Egmont's patch by
reordering <COMPOUND> before <SINGLE_OR_COMPOUND>, but I'm not sure
- ICU doesn't recognize some mixed case combinations as
digraphs/trigraphs, e.g. cS is treated as
<c><s>;<BAS><BAS>;<MIN><CAP>, not as <cs>;<BAS>;<MIN-CAP>. Perl and
glibc recognize them. Looking at some historical files in CLDR repo,
AIX and MS behaved as ICU, Sun JDK and IBM JDK behaved as glibc. I
haven't looked at the full CLDR repo. The following may be
interesting: http://unicode.org/cldr/trac/ticket/889 and
, recognition of those digraphs is still marked as unconfirmed draft
in the latest version of the file.