This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Patch 0/13] [BZ #14095] update collation data from Unicode / ISO 14651


Mike FABIAN <mfabian@redhat.com> さんはかきました:

> Joseph Myers <joseph@codesourcery.com> さんはかきました:
>
>> On Fri, 26 Jan 2018, Mike FABIAN wrote:
>>
>>> [BZ #14095] - Review / update collation data from Unicode / ISO 14651
>>> 
>>> Updating this file alone is not enough, there are problems in the new
>>> file which need to be fixed and the collation rules for many locales
>>> need to be adapted. This is done by the following patches.
>>> 
>>> This update also fixes the problem that many characters are treated as
>>> identical when sorting because they were not yet in the old
>>> iso14651_t1_common file, see:
>>
>> To be clear: do you mean it fixes it *for the characters in the Unicode 
>> version supported by these updated collation data*?  Or globally for all 
>> characters including those not yet defined or too new for that collation 
>> data?
>
> Yes, it fixes it only for the characters which are in this updated
> collation data, i.e. for all characters up to Unicode 8.0.0. All
> characters added after Unicode 8.0.0 or still undefined will still have
> that problem.
>
>> In the various cases where collation data has been changed locally since 
>> the previous import from ISO 14651, are those local changes all obsoleted 
>> by subsequent changes to the ISO 14651 collation data?
>
> The improvements mentioned in this comment in the old iso14651_t1_common
> file are obsoleted by the new file:
>
>      # IMPROVEMENTS:
>      #
>      # 1. converted to UTF-8 (for comments)
>      # 2. added Armenian script block, with proper sorting
>      # 3. added Tifinagh script block
>      # 4. added a whole lot of Latin script letters, so they are "properly"
>      #    sorted (not at random positions before "0" or after "z", but, for
>      #    example, "e with dot below" sorted as "e", etc.
>      # 5. added definitions of extra latin diacritics; otherwise it is not possible
>      #    to differentiate enough, for example "d caron" vs "d caron below"
>      #    those extra diacritics are:
>      #    <HOK> # hook above (vietnamese tone mark)
>      #    <DGR> # double grave
>      #    <IBR> # inverted breve
>      #    <BPT> # dot below (vietnamese tone mark)
>      #    <BRU> # diaeresis below
>      #    <BRN> # ring below
>      #    <BCI> # circumflex below
>      #    <BTI> # tilde below
>      #    <BBR> # breve below
>      #    <BMA> # macron below
>      #    <CRL> # curled letter/letter with hook
>      # 6. when a character has two diacritics the second one is referenced too, eg:
>      #    <U1EA5> <a>;<CIR>;<MIN>;<ACA> # ấ (a with circumflex and acute)
>      #    <U1EA9> <a>;<CIR>;<MIN>;<HOK> # ẩ (a with circumflex and hook)
>      #                ^1st^       ^2nd^
>      #    that allows differenciating between those two, but also, they get
>      #    sorted along with "a circumflex", which is nicer.
>      # 7. digraphs (as opposed to ligatures) are made synonyms of their
>      #    base letters (encoding digraphs is considered obsolete unicode behaviour
>      #    anyway); that is, the composing parts are "<BAS>" (or whatever diacritic
>      #    there may be) and not "<LIG>"; compare "ae", a ligature:
>      #    <U00E6> "<a><e>";"<LIG><LIG>";"<MIN><MIN>";IGNORE # 230 æ
>      #    with "ij", a digraph:
>      #    <U0133> "<i><j>";"<BAS><BAS>";"<MIN><MIN>";IGNORE # 329 <ij>
>      #    that means that "<a><e>" won't be seen as a synonym of "<ae>", but that
>      #    "<i><j>" will be a synonym of "<ij>"
>      # 8. t/s with cedilla and t/s with comma below are made synonyms
>      # 9. added various new cyrillic letters
>      # 10. put <PCL> and <LIG> after all diacritics (as that often is used for
>      #     chars that change more)
>      #
>      # 2005-11-29, Pablo Saratxaga <pablo@mandriva.com>
>
> We had some modifications to the sorting rules in the individual locale
> files, for example a special handling of space for Polish:
>
> Bug 388 - localedata/locales/pl_PL has incorrect LC_COLLATE <space> handling
> https://sourceware.org/bugzilla/show_bug.cgi?id=388
>
> This is not in CLDR collation rules for Polish. So whenever I found
> something special in our collation data which CLDR did not do
> I kept it. Another example is the uppercase first sorting for Estonian,
> after applying these patches we have in et_EE:
>
>     % Uppercase first:
>     % (This is not in the CLDR rules, but the old et_EE locale before I based
>     % the collation on iso_41651_t1 did uppercase first. I don’t know whether
>     % there is a good reason for this, but let’s keep it for the moment.
>     % This reimplementation of the Estonian sorting just reproduces the same
>     % order as before (except fixing some bugs,
>     % see: https://sourceware.org/bugzilla/show_bug.cgi?id=22517#c1)).
>     reorder-after <RES-1>
>     <CAP>
>     <MIN>
>
> CLDR sorts upper case first *only* for these 3 locales:
>
>     $ grep 'caseFirst upper' *
>     cu.xml:[caseFirst upper]
>     da.xml:                                 [caseFirst upper]
>     mt.xml:[caseFirst upper]  # DMS MSA 200:2009
>
> So for Estonian probably upper case should not be sorted first.  But I
> kept the old behaviour of sorting upper case first for the moment
> because our Estonian locale always did this. After checking with
> Estonian native speakers and/or standards we should either drop
> this and make it the same as CLDR or report a bug against CLDR.
>
> CLDR often sorts the native script for a locale first (i.e. before
> Latin). In CLDR this is indicated by something like:
>
>      [reorder Cyrl]
>
> to put Cyrillic script first. Currently this is quite difficult to do in
> glibc, therefore I did this only for the locales which already did it,
> for example uk_UA now has:
>
>     % Put Cyrillic before Latin because CLDR has:
>     %
>     % [reorder Cyrl]
>     %
>     % and because the old glibc collation for Ukrainian also did put
>     % Cyrillic before Latin.
>     %
>     % I copied the whole Cyrillic block from iso14651_t1_common here.
>     %
>     % I cannot find any better way doing this. 
>     reorder-after <BEFORE-LATIN>
>     <S0430> % CYRILLIC SMALL LETTER A
>     <S04D9> % CYRILLIC SMALL LETTER SCHWA
>     <S04D5> % CYRILLIC SMALL LIGATURE A IE
>     ...
>     list all cyrillic characters in the correct order for Ukrainian here
>     ...
>
> glibc seems to have a feature to make this easier, but unfortunately
> it doesn’t seem to work at the moment. In ber_MA I have added these
> lines (commented out) as a reminder that there is such a feature which
> would be useful if it could be fixed:
>
>     % “reorder-sections-after” unfortunately does not seem to work.
>     % Moroccan sorting standard requires tifinagh to come
>     % before latin script
>     %reorder-sections-after <SPECIAL>
>     %<TIFINAGH>
>     %reorder-sections-end
>
>
> I added such section names to the updated iso14651_t1_common file
> already, so as soon as I can figure out why "reorder-sections-after"
> does not work, I can update the locales which use [reorder SomeScript]
> in CLDR to do the same in glibc.

A few locales already had a test file like localedata/si_LK.UTF-8.in
(Unfortunately not many locales had this). For locales with existing
test files I made sure the file still sorts the same after the update
of iso14651_t1_common. For locales like si_LK with existing test files
this helped a lot.

For all locales where I had to touch the collation and no test file
existed, I added one.

-- 
Mike FABIAN <mfabian@redhat.com>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]