This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/16061] Review / update transliteration data


https://sourceware.org/bugzilla/show_bug.cgi?id=16061

--- Comment #6 from Marko Myllynen <myllynen at redhat dot com> ---
(In reply to Mike FABIAN from comment #5)
> (In reply to Marko Myllynen from comment #4)
> > (In reply to Mike FABIAN from comment #2)
> > > (In reply to Marko Myllynen from comment #0)
> > > 
> > > C-translit.h.in seems to be manually edited and not generated from
> > > Unicode data.
> > 
> > Based on earlier changelog comments it seems that C-translit.h.in was
> > updated manually for Unicode 3.2.0, should it now be updated for Unicode
> > 7.0.0 by some means?
> 
> Probably, but how?

Good question - do you see it feasible to use the generator to also produce
C-translit.h.in (sans the previous individual additions)?

> > Perhaps we could add a note about its origins to the file.
> 
> There is already a note in the comment section of that file.

Ah, not sure how I missed that.

> > Also, shouldn't à and à be handled in the same way?
> 
> What do you mean by âhandled in the same wayâ? 

After applying the patch we would have different kind of rules for à (U+00D6)
and à (U+00C6):

locales/translit_combining:<U00D8> <U004F>
locales/translit_neutral:<U00D8> "<U004F><U0045>"
locales/translit_combining:<U00C6> "<U0041><U0045>"
locales/translit_neutral:<U00C6> "<U0041><U0045>"

> > Looking at translit_neutral in more detail, I think it's actually wrong
> > place for letters, it should contain non-letters only and if specific rules
> > are needed for letters like à or Ã, those should be added directly in locale
> > files (so the patch discussed in bug 15593 should have not been applied to
> > translit_neutral after all). This would also mean that the special rules in
> > the generator for cases like EM DASH and EN DASH should probably end up to
> > translit_neutral not translit_combining.
> 
> My guess is that the purpose of translit_neutral is to contain
> transliterations which are locale âneutralâ, i.e. are the same for
> all locales. So I see no reason not to include letters.

Yeah, outright excluding *all* letters might be too harsh for cases where it's
clear what the result should be but from the discussion in bug 15593 and the
above handling of à I got an impression translit_neutral is probably not the
right place for it? If a letter is being added to translit_combining by the
generator isn't it then better to have it there than in the manually created
translit_neutral? I see that i18n includes translit_neutral, not sure does that
impose some requirements in any way.

> > > > but some characters (like U+00D6, Ã) have decomposition defined in
> > > > Unicode but not in glibc.
> > > 
> > > glibc had this already in translit_combining:
> > > 
> > > (was already there, not added by my patch, it is generated from
> > > UnicodeData.txt by decomposing to U+004F U+0308 and then stripping the
> > > combining character U+0308).
> > 
> > Yes, I think what I meant to say was that the decomposition to U+004F U+0308
> > was missing but as you point out it is defined in some locales where it
> > would be needed. Btw, I wonder should U+00D6 actually decompose to U+004F
> > U+00A8 after U+004F U+0308 in those locales?
> 
> Ã -> OÂ
> 
> Why? Is that a reasonable transliteration? It throws away less
> information but I think it is common practice to transliterate Ã
> just as O in English for example.

I was merely speculating on this, perhaps we can forget this part.

Thanks.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]