This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Patch 0/13] [BZ #14095] update collation data from Unicode / ISO 14651


Joseph Myers <joseph@codesourcery.com> さんはかきました:

> On Fri, 26 Jan 2018, Mike FABIAN wrote:
>
>> [BZ #14095] - Review / update collation data from Unicode / ISO 14651
>> 
>> Updating this file alone is not enough, there are problems in the new
>> file which need to be fixed and the collation rules for many locales
>> need to be adapted. This is done by the following patches.
>> 
>> This update also fixes the problem that many characters are treated as
>> identical when sorting because they were not yet in the old
>> iso14651_t1_common file, see:
>
> To be clear: do you mean it fixes it *for the characters in the Unicode 
> version supported by these updated collation data*?  Or globally for all 
> characters including those not yet defined or too new for that collation 
> data?

Yes, it fixes it only for the characters which are in this updated
collation data, i.e. for all characters up to Unicode 8.0.0. All
characters added after Unicode 8.0.0 or still undefined will still have
that problem.

> In the various cases where collation data has been changed locally since 
> the previous import from ISO 14651, are those local changes all obsoleted 
> by subsequent changes to the ISO 14651 collation data?

The improvements mentioned in this comment in the old iso14651_t1_common
file are obsoleted by the new file:

     # IMPROVEMENTS:
     #
     # 1. converted to UTF-8 (for comments)
     # 2. added Armenian script block, with proper sorting
     # 3. added Tifinagh script block
     # 4. added a whole lot of Latin script letters, so they are "properly"
     #    sorted (not at random positions before "0" or after "z", but, for
     #    example, "e with dot below" sorted as "e", etc.
     # 5. added definitions of extra latin diacritics; otherwise it is not possible
     #    to differentiate enough, for example "d caron" vs "d caron below"
     #    those extra diacritics are:
     #    <HOK> # hook above (vietnamese tone mark)
     #    <DGR> # double grave
     #    <IBR> # inverted breve
     #    <BPT> # dot below (vietnamese tone mark)
     #    <BRU> # diaeresis below
     #    <BRN> # ring below
     #    <BCI> # circumflex below
     #    <BTI> # tilde below
     #    <BBR> # breve below
     #    <BMA> # macron below
     #    <CRL> # curled letter/letter with hook
     # 6. when a character has two diacritics the second one is referenced too, eg:
     #    <U1EA5> <a>;<CIR>;<MIN>;<ACA> # ấ (a with circumflex and acute)
     #    <U1EA9> <a>;<CIR>;<MIN>;<HOK> # ẩ (a with circumflex and hook)
     #                ^1st^       ^2nd^
     #    that allows differenciating between those two, but also, they get
     #    sorted along with "a circumflex", which is nicer.
     # 7. digraphs (as opposed to ligatures) are made synonyms of their
     #    base letters (encoding digraphs is considered obsolete unicode behaviour
     #    anyway); that is, the composing parts are "<BAS>" (or whatever diacritic
     #    there may be) and not "<LIG>"; compare "ae", a ligature:
     #    <U00E6> "<a><e>";"<LIG><LIG>";"<MIN><MIN>";IGNORE # 230 æ
     #    with "ij", a digraph:
     #    <U0133> "<i><j>";"<BAS><BAS>";"<MIN><MIN>";IGNORE # 329 <ij>
     #    that means that "<a><e>" won't be seen as a synonym of "<ae>", but that
     #    "<i><j>" will be a synonym of "<ij>"
     # 8. t/s with cedilla and t/s with comma below are made synonyms
     # 9. added various new cyrillic letters
     # 10. put <PCL> and <LIG> after all diacritics (as that often is used for
     #     chars that change more)
     #
     # 2005-11-29, Pablo Saratxaga <pablo@mandriva.com>

We had some modifications to the sorting rules in the individual locale
files, for example a special handling of space for Polish:

Bug 388 - localedata/locales/pl_PL has incorrect LC_COLLATE <space> handling
https://sourceware.org/bugzilla/show_bug.cgi?id=388

This is not in CLDR collation rules for Polish. So whenever I found
something special in our collation data which CLDR did not do
I kept it. Another example is the uppercase first sorting for Estonian,
after applying these patches we have in et_EE:

    % Uppercase first:
    % (This is not in the CLDR rules, but the old et_EE locale before I based
    % the collation on iso_41651_t1 did uppercase first. I don’t know whether
    % there is a good reason for this, but let’s keep it for the moment.
    % This reimplementation of the Estonian sorting just reproduces the same
    % order as before (except fixing some bugs,
    % see: https://sourceware.org/bugzilla/show_bug.cgi?id=22517#c1)).
    reorder-after <RES-1>
    <CAP>
    <MIN>

CLDR sorts upper case first *only* for these 3 locales:

    $ grep 'caseFirst upper' *
    cu.xml:[caseFirst upper]
    da.xml:                                 [caseFirst upper]
    mt.xml:[caseFirst upper]  # DMS MSA 200:2009

So for Estonian probably upper case should not be sorted first.  But I
kept the old behaviour of sorting upper case first for the moment
because our Estonian locale always did this. After checking with
Estonian native speakers and/or standards we should either drop
this and make it the same as CLDR or report a bug against CLDR.

CLDR often sorts the native script for a locale first (i.e. before
Latin). In CLDR this is indicated by something like:

     [reorder Cyrl]

to put Cyrillic script first. Currently this is quite difficult to do in
glibc, therefore I did this only for the locales which already did it,
for example uk_UA now has:

    % Put Cyrillic before Latin because CLDR has:
    %
    % [reorder Cyrl]
    %
    % and because the old glibc collation for Ukrainian also did put
    % Cyrillic before Latin.
    %
    % I copied the whole Cyrillic block from iso14651_t1_common here.
    %
    % I cannot find any better way doing this. 
    reorder-after <BEFORE-LATIN>
    <S0430> % CYRILLIC SMALL LETTER A
    <S04D9> % CYRILLIC SMALL LETTER SCHWA
    <S04D5> % CYRILLIC SMALL LIGATURE A IE
    ...
    list all cyrillic characters in the correct order for Ukrainian here
    ...

glibc seems to have a feature to make this easier, but unfortunately
it doesn’t seem to work at the moment. In ber_MA I have added these
lines (commented out) as a reminder that there is such a feature which
would be useful if it could be fixed:

    % “reorder-sections-after” unfortunately does not seem to work.
    % Moroccan sorting standard requires tifinagh to come
    % before latin script
    %reorder-sections-after <SPECIAL>
    %<TIFINAGH>
    %reorder-sections-end


I added such section names to the updated iso14651_t1_common file
already, so as soon as I can figure out why "reorder-sections-after"
does not work, I can update the locales which use [reorder SomeScript]
in CLDR to do the same in glibc.

-- 
Mike FABIAN <mfabian@redhat.com>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]