This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug localedata/16061] Review / update transliteration data

From: "myllynen at redhat dot com" <sourceware-bugzilla at sourceware dot org>
To: libc-locales at sourceware dot org
Date: Mon, 04 May 2015 07:53:11 +0000
Subject: [Bug localedata/16061] Review / update transliteration data
Auto-submitted: auto-generated
References: <bug-16061-716 at http dot sourceware dot org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=16061

--- Comment #4 from Marko Myllynen <myllynen at redhat dot com> ---
(In reply to Mike FABIAN from comment #2)
> (In reply to Marko Myllynen from comment #0)
> 
> C-translit.h.in seems to be manually edited and not generated from
> Unicode data.

Based on earlier changelog comments it seems that C-translit.h.in was updated
manually for Unicode 3.2.0, should it now be updated for Unicode 7.0.0 by some
means?

As discussed off-list, it seems that there are transliterations defined only in
C-translit.h.in (like U20B9, INDIAN RUPEE SIGN) which take effect only with the
C/POSIX locale but they are not in any translit_* files, should C-translit.h.in
and translit_* files be synced for such cases? Or should C/POSIX perhaps be
"pure" without any other rules except those from derived from Unicode while the
rest could use locally added rules as well?

> These files seem to be automatically generated with some manual additions:
> 
>     locales/translit_circle
>     locales/translit_cjk_compat
>     locales/translit_combining
>     locales/translit_compat
>     locales/translit_font  
>     locales/translit_fraction
> 
> my patch updates them automatically from UnicodeData.txt keeping
> the manual additions whereever they seem to make sense.

Related to above, I wonder should we make local changes more obvious for
example by having translit_combining_unicode included from translit_combining?
It would make it much easier for others to see what definitions are coming from
Unicode and what definitions are ones provided by glibc. Or alternatively group
the generated rules separately inside translit_* files.

> is apparently manually edited and not generated.
> 
>     locales/translit_cjk_variants
> 
> is not generated from Unicode data either but from a UniVariants.Z
> file which can still be found here:
> 
> http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/UniVariants.Z
> 
> It is from 2002-08-15 and I have no idea how it has been created.
> So I did not touch /translit_cjk_variants.

Perhaps we could add a note about its origins to the file.

> The following files
> 
>     locales/translit_hangul
>     locales/translit_narrow
>     locales/translit_small
>     locales/translit_wide
> 
> are automatically generated, but generating them automatically from
> Unicode 7.0.0 data would just reproduce the files as they are now,
> there have been no updates. Therefore I didnât write generator
> scripts for these. Would generator scripts nevertheless be useful,
> so that we would notice if a change happens? I think a change
> in these files is very unlikely though.

It indeed sounds unlikely but having a generator available might make things
easier 10 or 20 years from now if someone wants to verify the situation then.
But I think it's your call, I'm ok either way.

> > Some individual examples of currently missing characters are U+00D8 (Ã)
> 
> This is already here:
> 
>     translit_neutral:<U00D8> "<U004F><U0045>"

Yes, it was added a bit after this report, see bug 15593 and commit f20820.

> (manually edited). And my patch adds it to translit_combining as:
> 
>     +% LATIN CAPITAL LETTER O WITH STROKE
>     +<U00D8> <U004F>
> 
> (But as a special hack, this does not come from UnicodeData.txt).

Please see the above bug for more discussion on this, not sure is there one
right answer which transliteration is the correct one to use here.

Also, shouldn't Ã and Ã be handled in the same way?

Looking at translit_neutral in more detail, I think it's actually wrong place
for letters, it should contain non-letters only and if specific rules are
needed for letters like Ã or Ã, those should be added directly in locale files
(so the patch discussed in bug 15593 should have not been applied to
translit_neutral after all). This would also mean that the special rules in the
generator for cases like EM DASH and EN DASH should probably end up to
translit_neutral not translit_combining.

> > and U+0110 (Ä)
> 
> Adding this seems to make sense as well

Perhaps it might be best to start with minimal set of special rules and commit
additional ones later (for example, I'd like to see 00D0, 00DE, and 014A with
their lowercase counterparts added)?

> > but some characters (like U+00D6, Ã) have decomposition defined in
> > Unicode but not in glibc.
> 
> glibc had this already in translit_combining:
> 
> (was already there, not added by my patch, it is generated from
> UnicodeData.txt by decomposing to U+004F U+0308 and then stripping the
> combining character U+0308).

Yes, I think what I meant to say was that the decomposition to U+004F U+0308
was missing but as you point out it is defined in some locales where it would
be needed. Btw, I wonder should U+00D6 actually decompose to U+004F U+00A8
after U+004F U+0308 in those locales?

Thanks.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]