This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Improved check-localedef script
- From: Mike FABIAN <mfabian at redhat dot com>
- To: Zack Weinberg <zackw at panix dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>, Rafal Luzynski <digitalfreak at lingonborough dot com>
- Date: Tue, 08 Aug 2017 09:00:26 +0200
- Subject: Re: Improved check-localedef script
- Authentication-results: sourceware.org; auth=none
- Authentication-results: ext-mx07.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
- Authentication-results: ext-mx07.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=mfabian at redhat dot com
- Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 739FBC04B31B
- References: <CAKCAbMjLN7SMWwveXVokSCttqso+r+1AttpFEpDBdJcSyiuQ4Q@mail.gmail.com> <CAKCAbMhVb3+CzRcSGTHVuahuwHryhtZTEYq=XiSyERtjPwbmXw@mail.gmail.com>
Zack Weinberg <zackw@panix.com> wrote:
> On Thu, Aug 3, 2017 at 5:17 PM, Zack Weinberg <zackw@panix.com> wrote:
>> Here is an improved version of the check-localedef script I posted the
>> other week.
>
> Here is another revision which uses the SUPPORTED file to learn the
> legacy encodings for each locale, rather than looking at %Charset:
> annotations in the source files. You run it like this now (from the
> top level of the source tree):
>
> $ ./scripts/check-localedef.py -p localedata/locales -f
> localedata/SUPPORTED localedata/locales/*
>
> The final "localedata/locales/*" part is not _required_; it only
> enables the script to tell you about any locales that are missing from
> the SUPPORTED file.
>
> (Also, still more bugs have been fixed; in particular the
> "inappropriate character" errors have been restored. Doh.)
>
> It's possible that Python isn't going to work out as the
> implementation language for this script. I used it because its
> standard library provides Unicode normalization and many codecs for
> legacy encodings, but it doesn't know all of the encodings mentioned
> in localedata/SUPPORTED (ARMSCII-8, GEORGIAN-PS, and EUC-TW are
> missing) and I don't think it knows how to do transliteration, either.
> And it's still a solid order of magnitude slower than it should be.
localedata/locales/uz_UZ:212: string not representable in iso8859-1:
0073 006F 02BB 006D
That is “soʻm” where the 3rd character is U+02BB MODIFIER LETTER TURNED COMMA.
In the Latin1 version of the uz_UZ locale this gets transliterated
into U+0027 APOSTROPHE:
$ LC_ALL=uz_UZ.ISO-8859-1 locale -k currency_symbol
currency_symbol="so'm"
It looks like most of the “string not representable” warnings are false
positives.
--
Mike FABIAN <mfabian@redhat.com>