This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [Patch 0/13] [BZ #14095] update collation data from Unicode / ISO 14651
Mike FABIAN <mfabian@redhat.com> さんはかきました:
> Joseph Myers <joseph@codesourcery.com> さんはかきました:
>
>> On Fri, 26 Jan 2018, Mike FABIAN wrote:
>>
>>> [BZ #14095] - Review / update collation data from Unicode / ISO 14651
>>>
>>> Updating this file alone is not enough, there are problems in the new
>>> file which need to be fixed and the collation rules for many locales
>>> need to be adapted. This is done by the following patches.
>>>
>>> This update also fixes the problem that many characters are treated as
>>> identical when sorting because they were not yet in the old
>>> iso14651_t1_common file, see:
>>
>> To be clear: do you mean it fixes it *for the characters in the Unicode
>> version supported by these updated collation data*? Or globally for all
>> characters including those not yet defined or too new for that collation
>> data?
>
> Yes, it fixes it only for the characters which are in this updated
> collation data, i.e. for all characters up to Unicode 8.0.0. All
> characters added after Unicode 8.0.0 or still undefined will still have
> that problem.
>
>> In the various cases where collation data has been changed locally since
>> the previous import from ISO 14651, are those local changes all obsoleted
>> by subsequent changes to the ISO 14651 collation data?
>
> The improvements mentioned in this comment in the old iso14651_t1_common
> file are obsoleted by the new file:
>
> # IMPROVEMENTS:
> #
> # 1. converted to UTF-8 (for comments)
> # 2. added Armenian script block, with proper sorting
> # 3. added Tifinagh script block
> # 4. added a whole lot of Latin script letters, so they are "properly"
> # sorted (not at random positions before "0" or after "z", but, for
> # example, "e with dot below" sorted as "e", etc.
> # 5. added definitions of extra latin diacritics; otherwise it is not possible
> # to differentiate enough, for example "d caron" vs "d caron below"
> # those extra diacritics are:
> # <HOK> # hook above (vietnamese tone mark)
> # <DGR> # double grave
> # <IBR> # inverted breve
> # <BPT> # dot below (vietnamese tone mark)
> # <BRU> # diaeresis below
> # <BRN> # ring below
> # <BCI> # circumflex below
> # <BTI> # tilde below
> # <BBR> # breve below
> # <BMA> # macron below
> # <CRL> # curled letter/letter with hook
> # 6. when a character has two diacritics the second one is referenced too, eg:
> # <U1EA5> <a>;<CIR>;<MIN>;<ACA> # ấ (a with circumflex and acute)
> # <U1EA9> <a>;<CIR>;<MIN>;<HOK> # ẩ (a with circumflex and hook)
> # ^1st^ ^2nd^
> # that allows differenciating between those two, but also, they get
> # sorted along with "a circumflex", which is nicer.
> # 7. digraphs (as opposed to ligatures) are made synonyms of their
> # base letters (encoding digraphs is considered obsolete unicode behaviour
> # anyway); that is, the composing parts are "<BAS>" (or whatever diacritic
> # there may be) and not "<LIG>"; compare "ae", a ligature:
> # <U00E6> "<a><e>";"<LIG><LIG>";"<MIN><MIN>";IGNORE # 230 æ
> # with "ij", a digraph:
> # <U0133> "<i><j>";"<BAS><BAS>";"<MIN><MIN>";IGNORE # 329 <ij>
> # that means that "<a><e>" won't be seen as a synonym of "<ae>", but that
> # "<i><j>" will be a synonym of "<ij>"
> # 8. t/s with cedilla and t/s with comma below are made synonyms
> # 9. added various new cyrillic letters
> # 10. put <PCL> and <LIG> after all diacritics (as that often is used for
> # chars that change more)
> #
> # 2005-11-29, Pablo Saratxaga <pablo@mandriva.com>
>
> We had some modifications to the sorting rules in the individual locale
> files, for example a special handling of space for Polish:
>
> Bug 388 - localedata/locales/pl_PL has incorrect LC_COLLATE <space> handling
> https://sourceware.org/bugzilla/show_bug.cgi?id=388
>
> This is not in CLDR collation rules for Polish. So whenever I found
> something special in our collation data which CLDR did not do
> I kept it. Another example is the uppercase first sorting for Estonian,
> after applying these patches we have in et_EE:
>
> % Uppercase first:
> % (This is not in the CLDR rules, but the old et_EE locale before I based
> % the collation on iso_41651_t1 did uppercase first. I don’t know whether
> % there is a good reason for this, but let’s keep it for the moment.
> % This reimplementation of the Estonian sorting just reproduces the same
> % order as before (except fixing some bugs,
> % see: https://sourceware.org/bugzilla/show_bug.cgi?id=22517#c1)).
> reorder-after <RES-1>
> <CAP>
> <MIN>
>
> CLDR sorts upper case first *only* for these 3 locales:
>
> $ grep 'caseFirst upper' *
> cu.xml:[caseFirst upper]
> da.xml: [caseFirst upper]
> mt.xml:[caseFirst upper] # DMS MSA 200:2009
>
> So for Estonian probably upper case should not be sorted first. But I
> kept the old behaviour of sorting upper case first for the moment
> because our Estonian locale always did this. After checking with
> Estonian native speakers and/or standards we should either drop
> this and make it the same as CLDR or report a bug against CLDR.
>
> CLDR often sorts the native script for a locale first (i.e. before
> Latin). In CLDR this is indicated by something like:
>
> [reorder Cyrl]
>
> to put Cyrillic script first. Currently this is quite difficult to do in
> glibc, therefore I did this only for the locales which already did it,
> for example uk_UA now has:
>
> % Put Cyrillic before Latin because CLDR has:
> %
> % [reorder Cyrl]
> %
> % and because the old glibc collation for Ukrainian also did put
> % Cyrillic before Latin.
> %
> % I copied the whole Cyrillic block from iso14651_t1_common here.
> %
> % I cannot find any better way doing this.
> reorder-after <BEFORE-LATIN>
> <S0430> % CYRILLIC SMALL LETTER A
> <S04D9> % CYRILLIC SMALL LETTER SCHWA
> <S04D5> % CYRILLIC SMALL LIGATURE A IE
> ...
> list all cyrillic characters in the correct order for Ukrainian here
> ...
>
> glibc seems to have a feature to make this easier, but unfortunately
> it doesn’t seem to work at the moment. In ber_MA I have added these
> lines (commented out) as a reminder that there is such a feature which
> would be useful if it could be fixed:
>
> % “reorder-sections-after” unfortunately does not seem to work.
> % Moroccan sorting standard requires tifinagh to come
> % before latin script
> %reorder-sections-after <SPECIAL>
> %<TIFINAGH>
> %reorder-sections-end
>
>
> I added such section names to the updated iso14651_t1_common file
> already, so as soon as I can figure out why "reorder-sections-after"
> does not work, I can update the locales which use [reorder SomeScript]
> in CLDR to do the same in glibc.
A few locales already had a test file like localedata/si_LK.UTF-8.in
(Unfortunately not many locales had this). For locales with existing
test files I made sure the file still sorts the same after the update
of iso14651_t1_common. For locales like si_LK with existing test files
this helped a lot.
For all locales where I had to touch the collation and no test file
existed, I added one.
--
Mike FABIAN <mfabian@redhat.com>