This is the mail archive of the
mailing list for the GNU libc locales project.
Re: [PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.
- From: Luis Javier Merino <ninjalj at gmail dot com>
- To: Egmont Koblinger <egmont at gmail dot com>
- Cc: "Carlos O'Donell" <carlos at redhat dot com>, libc-locales <libc-locales at sourceware dot org>
- Date: Wed, 1 Feb 2017 17:00:47 +0100
- Subject: Re: [PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.
- Authentication-results: sourceware.org; auth=none
- References: <CAGWcZkLbhdJWRZLDKHXrHf2875pKLushYJon7YusGu=zhpO7mQ@mail.gmail.com> <CAGWcZkLsGUcfmw6X4VT7sWZX5juh5WFkJe=ChV+K2myjDmbuEA@mail.gmail.com> <20160421061349.GM5369@vapier.lan> <CAGWcZkK=UXGDEG6moxcyG9PJz4D=V=kVR6G1u=uhSFqgu+m+oA@mail.gmail.com> <CAGWcZkLyq8XJ5utRbZ6A58BhpdZdhrAi7m-TGa_W367ymKofirstname.lastname@example.org> <email@example.com> <CAGWcZkJ_m26UxF=+P7U-Kdw+6msvTE_e=TQNOt-_F1zihjheAQ@mail.gmail.com>
I did some investigation of Hungarian collation for a code golf at
Hungarian has digraphs and trigraphs (cs, dz, dzs, gy, ly, ny, sz, ty,
zs). It also has geminated (long) consonants, which are represented by
writing the consonant twice. In the case of digraphs/trigraphs, they
can be written in a long (duplicate the whole digraph/trigraph) and
short form (duplicate only the first consonant of the
Not all occurrences of the consonants in a digraph/trigraph represent
a digraph/trigraph, e.g: in házszám zs doesn't represent a digraph,
but sz does. This means you need a dictionary or similar to get a
(nearly) fully correct collation. IIRC, LibreOffice uses libhnj, which
uses rules derived from a dictionary.
These are the differences I noticed between Egmont's testsuite and ICU:
- Egmont collates the short forms before the full forms (ssz < szsz,
..., zzs < zszs ), ICU collates the long forms before the short forms
starting at L3 Case and Variants (szsz <3 ssz, ..., zszs <3 zzs ). I
don't think that is specified in the grammar rules, but I can't read
- ICU treats weirdly capitalized groups as
non-contractions/non-digraphs/non-trigraphs, e.g: ccS <3 CcS <3 cCs <3
cCS <3 CCs <3 cS <3 cs <3 Cs <3 CS <3 ccs <3 Ccs <3 CCS
I don't know which behavior comes from the CLDR, and which is specific to ICU.
(where I talk about glibc in the post at codegolf.se, I actually talk
about glibc with Egmont's patch, which I assumed would be merged