This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.


I did some investigation of Hungarian collation for a code golf at
http://codegolf.stackexchange.com/a/75599/267

Hungarian has digraphs and trigraphs (cs, dz, dzs, gy, ly, ny, sz, ty,
zs). It also has geminated (long) consonants, which are represented by
writing the consonant twice. In the case of digraphs/trigraphs, they
can be written in a long (duplicate the whole digraph/trigraph) and
short form (duplicate only the first consonant of the
digraph/trigraph).

Not all occurrences of the consonants in a digraph/trigraph represent
a digraph/trigraph, e.g: in házszám zs doesn't represent a digraph,
but sz does. This means you need a dictionary or similar to get a
(nearly) fully correct collation. IIRC, LibreOffice uses libhnj, which
uses rules derived from a dictionary.

These are the differences I noticed between Egmont's testsuite and ICU:

 - Egmont collates the short forms before the full forms (ssz < szsz,
..., zzs < zszs ), ICU collates the long forms before the short forms
starting at L3 Case and Variants (szsz <3 ssz, ..., zszs <3 zzs ). I
don't think that is specified in the grammar rules, but I can't read
Hungarian.

 - ICU treats weirdly capitalized groups as
non-contractions/non-digraphs/non-trigraphs, e.g: ccS <3 CcS <3 cCs <3
cCS <3 CCs <3 cS <3 cs <3 Cs <3 CS <3 ccs <3 Ccs <3 CCS

I don't know which behavior comes from the CLDR, and which is specific to ICU.

(where I talk about glibc in the post at codegolf.se, I actually talk
about glibc with Egmont's patch, which I assumed would be merged
soon).


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]