This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Improved check-localedef script


On Fri, Aug 4, 2017 at 5:32 AM, Mike FABIAN <mfabian@redhat.com> wrote:
>> These are the collating tables.  Necessary for UTF-8 but I'm not sure
>> what to do with them in 8-bit charset.
>
> The cs_CZ collation tables contain many characters not from the Czech
> language.
>
> Line 477 has Æ U+00C6:
>
> <U00C6> "<U0041><U0045>";"<U00C6><U00C6>";"<CAPITAL><CAPITAL>";"<U0041><U0045>"
>
> I am surprised that the script doesn’t print even more warnings,
> it doesn’t print warnings for these:
>
> % katakana/hiragana sorting
> % base is katakana, as this is present in most charsets
> % normal before voiced before semi-voiced
> % small vocals before normal vocals
> % katakana before hiragana
>
> <U30A1> <U30A1>;<U30A1>;IGNORE;<U30A1>
> ...

The script currently has no understanding of the large-scale structure
of a locale definition.  It attempts to convert _all_ double-quoted
strings, and _only_ double-quoted strings, into the encoding for the
locale, regardless of context.

I can _implement_ a more sophisticated parser but someone is going to
have to tell me what it should do.  Everything I know about these
files comes from the POSIX spec for localedef, and that is written so
generically that it's no help with questions like "do collation table
entries have to be fully representable in the encoding passed as the
-f option to localedef?"

zw


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]