This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/16061] Review / update transliteration data


https://sourceware.org/bugzilla/show_bug.cgi?id=16061

--- Comment #2 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Marko Myllynen from comment #0)
> The localedata/locales/translit_* files are probably, based on comments in
> them, at least partially generated from some version of UnicodeData.txt
> (based on 93a568 it looks like the last major update has been for Unicode
> 3.2 and 17b16e suggests them originally coming from an external
> contributor). However, there are some characters missing even from the
> Latin-1 Supplement block and in general it doesn't seem possible to update
> the files just by using UnicodeData.txt. Some of the rules live in
> locale/C-translit.h / locale/C-translit.h.in which also contain local
> changes (like 61d5a6 / 2a81ea).

C-translit.h.in seems to be manually edited and not generated from
Unicode data.

> It requires likely a lot of work to understand how the files have been
> generated in the first place, how to identify relevant local changes, and
> how to automate the process to update them in the future.

These files seem to be automatically generated with some manual additions:

    locales/translit_circle
    locales/translit_cjk_compat
    locales/translit_combining
    locales/translit_compat
    locales/translit_font  
    locales/translit_fraction

my patch updates them automatically from UnicodeData.txt keeping
the manual additions whereever they seem to make sense.

    locales/translit_neutral

is apparently manually edited and not generated.

    locales/translit_cjk_variants

is not generated from Unicode data either but from a UniVariants.Z
file which can still be found here:

http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/UniVariants.Z

It is from 2002-08-15 and I have no idea how it has been created.
So I did not touch /translit_cjk_variants.

The following files

    locales/translit_hangul
    locales/translit_narrow
    locales/translit_small
    locales/translit_wide

are automatically generated, but generating them automatically from
Unicode 7.0.0 data would just reproduce the files as they are now,
there have been no updates. Therefore I didnât write generator
scripts for these. Would generator scripts nevertheless be useful,
so that we would notice if a change happens? I think a change
in these files is very unlikely though.

> Some individual examples of currently missing characters are U+00D8 (Ã)

This is already here:

    translit_neutral:<U00D8> "<U004F><U0045>"

(manually edited). And my patch adds it to translit_combining as:

    +% LATIN CAPITAL LETTER O WITH STROKE
    +<U00D8> <U004F>

(But as a special hack, this does not come from UnicodeData.txt).

> and U+0110 (Ä)

Adding this seems to make sense as well, I added it to the âspecial
hackâ section of my gen_translit_combining.py:

    special_decompose_dict = {
        (0x0110,): [0x0044], # Ä â D
        (0x0111,): [0x0064], # Ä â d
    ...

> whereas other characters like U+00C6 (Ã) and U+0141
> (Å) from their blocks (Latin-1 Supplement and Latin Extended-A,
> respectively) are present. Some characters (like U+2033, â) have
> decomposition defined as is in Unicode

Yes, this one is in translit_compat:

    $ grep -B1 U2033 translit_compat 
    % DOUBLE PRIME
    <U2033> "<U2032><U2032>"

(was already there, not added by my patch, it is generated from
UnicodeData.txt).

> but some characters (like U+00D6, Ã) have decomposition defined in
> Unicode but not in glibc.

glibc had this already in translit_combining:

    $ grep -B1 U00D6 translit_combining
    % LATIN CAPITAL LETTER O WITH DIAERESIS
    <U00D6> <U004F>

(was already there, not added by my patch, it is generated from
UnicodeData.txt by decomposing to U+004F U+0308 and then stripping the
combining character U+0308).

before commit 18a3a9a3 this was in locale/C-translit.h.in
but it was apparently removed on purpose by commit 18a3a9a3:

    -/* <U00D6> LATIN CAPITAL LETTER O WITH DIAERESIS.  */
    -/* XXX It is not clear whether this is the best transliteration for
    -   all locales.  If not, we probably have to take it out completely.  */
    -"\xd6"   "OE"

âÃâ is transliterated to âOEâ for example in German, but in English
one usually transliterates it just as âOâ. Therefore, translit_combining
transliterates it to âOâ by decomposing and stripping the combining
character and locales like de_DE add their own transliteration rules:

    $ grep -A20 translit_start de_DE
    translit_start

    include "translit_combining";""

    % German umlauts.
    % LATIN CAPITAL LETTER A WITH DIAERESIS.
    <U00C4> "<U0041><U0308>";"<U0041><U0045>"
    % LATIN CAPITAL LETTER O WITH DIAERESIS.
    <U00D6> "<U004F><U0308>";"<U004F><U0045>"
    % LATIN CAPITAL LETTER U WITH DIAERESIS.
    <U00DC> "<U0055><U0308>";"<U0055><U0045>"
    % LATIN SMALL LETTER A WITH DIAERESIS.
    <U00E4> "<U0061><U0308>";"<U0061><U0065>"
    % LATIN SMALL LETTER O WITH DIAERESIS.
    <U00F6> "<U006F><U0308>";"<U006F><U0065>"
    % LATIN SMALL LETTER U WITH DIAERESIS.
    <U00FC> "<U0075><U0308>";"<U0075><U0065>"

    % Danish.
    % LATIN CAPITAL LETTER A WITH RING ABOVE.
    <U00C5> "<U0041><U030A>";"<U0041><U0041>"
    ...

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]