This is the mail archive of the
libc-locales@sourceware.org
mailing list for the GNU libc locales project.
[Bug localedata/16061] Review / update transliteration data
- From: "myllynen at redhat dot com" <sourceware-bugzilla at sourceware dot org>
- To: libc-locales at sourceware dot org
- Date: Mon, 04 May 2015 11:37:29 +0000
- Subject: [Bug localedata/16061] Review / update transliteration data
- Auto-submitted: auto-generated
- References: <bug-16061-716 at http dot sourceware dot org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=16061
--- Comment #6 from Marko Myllynen <myllynen at redhat dot com> ---
(In reply to Mike FABIAN from comment #5)
> (In reply to Marko Myllynen from comment #4)
> > (In reply to Mike FABIAN from comment #2)
> > > (In reply to Marko Myllynen from comment #0)
> > >
> > > C-translit.h.in seems to be manually edited and not generated from
> > > Unicode data.
> >
> > Based on earlier changelog comments it seems that C-translit.h.in was
> > updated manually for Unicode 3.2.0, should it now be updated for Unicode
> > 7.0.0 by some means?
>
> Probably, but how?
Good question - do you see it feasible to use the generator to also produce
C-translit.h.in (sans the previous individual additions)?
> > Perhaps we could add a note about its origins to the file.
>
> There is already a note in the comment section of that file.
Ah, not sure how I missed that.
> > Also, shouldn't à and à be handled in the same way?
>
> What do you mean by âhandled in the same wayâ?
After applying the patch we would have different kind of rules for à (U+00D6)
and à (U+00C6):
locales/translit_combining:<U00D8> <U004F>
locales/translit_neutral:<U00D8> "<U004F><U0045>"
locales/translit_combining:<U00C6> "<U0041><U0045>"
locales/translit_neutral:<U00C6> "<U0041><U0045>"
> > Looking at translit_neutral in more detail, I think it's actually wrong
> > place for letters, it should contain non-letters only and if specific rules
> > are needed for letters like à or Ã, those should be added directly in locale
> > files (so the patch discussed in bug 15593 should have not been applied to
> > translit_neutral after all). This would also mean that the special rules in
> > the generator for cases like EM DASH and EN DASH should probably end up to
> > translit_neutral not translit_combining.
>
> My guess is that the purpose of translit_neutral is to contain
> transliterations which are locale âneutralâ, i.e. are the same for
> all locales. So I see no reason not to include letters.
Yeah, outright excluding *all* letters might be too harsh for cases where it's
clear what the result should be but from the discussion in bug 15593 and the
above handling of à I got an impression translit_neutral is probably not the
right place for it? If a letter is being added to translit_combining by the
generator isn't it then better to have it there than in the manually created
translit_neutral? I see that i18n includes translit_neutral, not sure does that
impose some requirements in any way.
> > > > but some characters (like U+00D6, Ã) have decomposition defined in
> > > > Unicode but not in glibc.
> > >
> > > glibc had this already in translit_combining:
> > >
> > > (was already there, not added by my patch, it is generated from
> > > UnicodeData.txt by decomposing to U+004F U+0308 and then stripping the
> > > combining character U+0308).
> >
> > Yes, I think what I meant to say was that the decomposition to U+004F U+0308
> > was missing but as you point out it is defined in some locales where it
> > would be needed. Btw, I wonder should U+00D6 actually decompose to U+004F
> > U+00A8 after U+004F U+0308 in those locales?
>
> Ã -> OÂ
>
> Why? Is that a reasonable transliteration? It throws away less
> information but I think it is common practice to transliterate Ã
> just as O in English for example.
I was merely speculating on this, perhaps we can forget this part.
Thanks.
--
You are receiving this mail because:
You are on the CC list for the bug.