This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/1430] New: regression: worse collation for hu_HU


Please revert libc/localedata/locales/hu_HU revision 1.18, "Better collation".
It is not better, it is worse.

According to the Hungarian rules, aacute, eacute, iacute, oacute and uacute
must be treated the same as their unaccented counterparts, also wovels with
diaeresis should be treated the same as ther counterparts with double acutes.
In other words:
a = á < e = é < i = í < o = ó < ö = &#337; < u = ú < ü = &#369;

For example, the following is a correct alphabetical order:
ablak
állat
apa
áru
az

These wovels in one equivalence class only make a difference if they are the
only letters which differ, e.g.:
Eger
egér
éger
eget
éget

This was perfectly implemented in the previous version, as well as mentioned
in some comment lines within this file (which comment is still there although
it doesn't correspond to what's implemented right now).

I don't know who and why suggested the modifications of 1.18, but he was surely
wrong. If needed, I can scan some pages of dictionaries or phone books and
upload it to prove these sorting rules.

If someone just happens to prefer sorting this way, then he is of course
absolutely free to create an own locale for himself, or set LC_COLLATE=C or
something similar, but there's hardly any place for that work in glibc. Glibc
should follow the national rules, and r1.18 was a move against it.


Ulrich, If I recall correctly, some years ago it was you to whom I sent the
hu_HU sorting rules which fixed some bugs. Then you asked me to manually
sort a lot of words you had previously received from some other Hungarian guy
and test whether glibc sorts it in the same order. Then glibc with those
Hungarian collating rules passed that test, but the new rules would obviously
fail on them. Do you happen to still have that file? (I don't think I have
them, but I'll take a look at it.)

I guess it would be a really wise move to put such kind of sorted files into
glibc's source and add a sorting test case for them.


Ps1: a and á, as well as e and é are different voices so it's often argued
if it's logical to put them in the same group, this is rather a tradition than
a logical decision. On the other hand, i and í, o and ó, ö and &#337;, u and ú, and
finally ü and &#369; are the same voices, with the latter ones pronounced longer.
Crosswords and similar stuff treat a and á, and é and é differently, while the
other pairs are interchangeable there. But alphabetical sorting uses different
rules.

Ps2: All the words above in the examples are real Hungarian words.

-- 
           Summary: regression: worse collation for hu_HU
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
        AssignedTo: libc-locales at sources dot redhat dot com
        ReportedBy: egmont at uhulinux dot hu
                CC: glibc-bugs at sources dot redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=1430

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]