This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Improved check-localedef script


On Fri, Aug 4, 2017 at 5:14 AM, Rafal Luzynski
<digitalfreak@lingonborough.com> wrote:
> 3.08.2017 23:17 Zack Weinberg <zackw@panix.com> wrote:
>> localedata/locales/br_FR... (charset: iso8859-1)
>>   localedata/locales/br_FR:122: string not representable in iso8859-1:
>>       006D 0065 0072 0063 02BC 0068 0065 0072
>> [...]
>
> Most probably this is because of <U02BC> which is a Unicode apostrophe.
> In order to be representable in iso8859-1 it needs to be converted
> to an ASCII apostrophe <U0027>.  Can we please have this in the conversion
> script?  This is really necessary as br_FR must be converted to both
> UTF-8 and ISO 88859-1.

Just to clarify, what I have written is not intended to be any sort of
_conversion_ script.  It is intended to be a _test_ script, which
will, once we get all the errors ironed out, run as part of "make
check" to ensure that new encoding-related mistakes do not appear in
the locales.

Now, what I think you're trying to say here is that it is okay to use
<U02BC> in br_FR because, when localedef generates the legacy
iso-8859-1 version of the locale, it will transliterate that to the
ASCII apostrophe.  Unfortunately, Python (as of 3.6)'s codecs have no
equivalent of the //translit mechanism in glibc's iconv, so I don't
(right now, anyway) see any way the script could know that.  I'm open
to suggestions.

>> localedata/locales/da_DK... (charset: iso8859-1)
>>   localedata/locales/da_DK:145: string not representable in iso8859-1:
>>       0041 0308
>
> This is false positive: 0308 is a combining diaeresis character so
> 0041 0308 produces A with diaeresis (Ä) which is representable in
> iso8859-1 as C4.  Even diaeresis standalone is representable as A8.

This is a similar issue.  Python's codecs will not attempt to
renormalize a character sequence before encoding it.

>>> "\u00C4".encode("iso-8859-1")
b'\xc4'
>>> unicodedata.normalize("NFC", "\u0041\u0308").encode("iso-8859-1")
b'\xc4'
>>> "\u0041\u0308".encode("iso-8859-1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0308' in
position 1: ordinal not in range(256)

Perhaps I should go back to throwing errors on all non-NFC strings?  I
changed the script to allow NFD as well because it seemed like at
least some instances of NFD were intentional, but...

zw


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]