This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug localedata/16061] Review / update transliteration data
- From: "maiku.fabian at gmail dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Wed, 29 Apr 2015 07:12:21 +0000
- Subject: [Bug localedata/16061] Review / update transliteration data
- Auto-submitted: auto-generated
- References: <bug-16061-131 at http dot sourceware dot org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=16061
--- Comment #2 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Marko Myllynen from comment #0)
> The localedata/locales/translit_* files are probably, based on comments in
> them, at least partially generated from some version of UnicodeData.txt
> (based on 93a568 it looks like the last major update has been for Unicode
> 3.2 and 17b16e suggests them originally coming from an external
> contributor). However, there are some characters missing even from the
> Latin-1 Supplement block and in general it doesn't seem possible to update
> the files just by using UnicodeData.txt. Some of the rules live in
> locale/C-translit.h / locale/C-translit.h.in which also contain local
> changes (like 61d5a6 / 2a81ea).
C-translit.h.in seems to be manually edited and not generated from
Unicode data.
> It requires likely a lot of work to understand how the files have been
> generated in the first place, how to identify relevant local changes, and
> how to automate the process to update them in the future.
These files seem to be automatically generated with some manual additions:
locales/translit_circle
locales/translit_cjk_compat
locales/translit_combining
locales/translit_compat
locales/translit_font
locales/translit_fraction
my patch updates them automatically from UnicodeData.txt keeping
the manual additions whereever they seem to make sense.
locales/translit_neutral
is apparently manually edited and not generated.
locales/translit_cjk_variants
is not generated from Unicode data either but from a UniVariants.Z
file which can still be found here:
http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/UniVariants.Z
It is from 2002-08-15 and I have no idea how it has been created.
So I did not touch /translit_cjk_variants.
The following files
locales/translit_hangul
locales/translit_narrow
locales/translit_small
locales/translit_wide
are automatically generated, but generating them automatically from
Unicode 7.0.0 data would just reproduce the files as they are now,
there have been no updates. Therefore I didnât write generator
scripts for these. Would generator scripts nevertheless be useful,
so that we would notice if a change happens? I think a change
in these files is very unlikely though.
> Some individual examples of currently missing characters are U+00D8 (Ã)
This is already here:
translit_neutral:<U00D8> "<U004F><U0045>"
(manually edited). And my patch adds it to translit_combining as:
+% LATIN CAPITAL LETTER O WITH STROKE
+<U00D8> <U004F>
(But as a special hack, this does not come from UnicodeData.txt).
> and U+0110 (Ä)
Adding this seems to make sense as well, I added it to the âspecial
hackâ section of my gen_translit_combining.py:
special_decompose_dict = {
(0x0110,): [0x0044], # Ä â D
(0x0111,): [0x0064], # Ä â d
...
> whereas other characters like U+00C6 (Ã) and U+0141
> (Å) from their blocks (Latin-1 Supplement and Latin Extended-A,
> respectively) are present. Some characters (like U+2033, â) have
> decomposition defined as is in Unicode
Yes, this one is in translit_compat:
$ grep -B1 U2033 translit_compat
% DOUBLE PRIME
<U2033> "<U2032><U2032>"
(was already there, not added by my patch, it is generated from
UnicodeData.txt).
> but some characters (like U+00D6, Ã) have decomposition defined in
> Unicode but not in glibc.
glibc had this already in translit_combining:
$ grep -B1 U00D6 translit_combining
% LATIN CAPITAL LETTER O WITH DIAERESIS
<U00D6> <U004F>
(was already there, not added by my patch, it is generated from
UnicodeData.txt by decomposing to U+004F U+0308 and then stripping the
combining character U+0308).
before commit 18a3a9a3 this was in locale/C-translit.h.in
but it was apparently removed on purpose by commit 18a3a9a3:
-/* <U00D6> LATIN CAPITAL LETTER O WITH DIAERESIS. */
-/* XXX It is not clear whether this is the best transliteration for
- all locales. If not, we probably have to take it out completely. */
-"\xd6" "OE"
âÃâ is transliterated to âOEâ for example in German, but in English
one usually transliterates it just as âOâ. Therefore, translit_combining
transliterates it to âOâ by decomposing and stripping the combining
character and locales like de_DE add their own transliteration rules:
$ grep -A20 translit_start de_DE
translit_start
include "translit_combining";""
% German umlauts.
% LATIN CAPITAL LETTER A WITH DIAERESIS.
<U00C4> "<U0041><U0308>";"<U0041><U0045>"
% LATIN CAPITAL LETTER O WITH DIAERESIS.
<U00D6> "<U004F><U0308>";"<U004F><U0045>"
% LATIN CAPITAL LETTER U WITH DIAERESIS.
<U00DC> "<U0055><U0308>";"<U0055><U0045>"
% LATIN SMALL LETTER A WITH DIAERESIS.
<U00E4> "<U0061><U0308>";"<U0061><U0065>"
% LATIN SMALL LETTER O WITH DIAERESIS.
<U00F6> "<U006F><U0308>";"<U006F><U0065>"
% LATIN SMALL LETTER U WITH DIAERESIS.
<U00FC> "<U0075><U0308>";"<U0075><U0065>"
% Danish.
% LATIN CAPITAL LETTER A WITH RING ABOVE.
<U00C5> "<U0041><U030A>";"<U0041><U0041>"
...
--
You are receiving this mail because:
You are on the CC list for the bug.