This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Improved check-localedef script


Zack Weinberg <zackw@panix.com> wrote:

> Here is an improved version of the check-localedef script I posted the
> other week.  It now takes only about 1.5 seconds to process all the
> files in localedata/locales/ (instead of seven seconds with the old
> parser), which is fast enough that I think it would be reasonable to
> run it during 'make check'.  Also, many bugs have been fixed.
> Especially, the "can we encode this string in the charset that the
> file is annotated with" test now actually _runs_...

Great!

> ... and finds dozens and dozens of errors. The full list is attached,
> but here's a small sample:
>
> localedata/locales/ur_PK... (charset: cp1256)
>   localedata/locales/ur_PK:114: string not representable in cp1256:
>       062C 0646 0648 0631 06CC
>   localedata/locales/ur_PK:115: string not representable in cp1256:
>       0641 0631 0648 0631 06CC
>   localedata/locales/ur_PK:117: string not representable in cp1256:
>       0627 067E 0631 06CC 0644
>
> These are the abmon strings, so I think it really would be a problem...

This is the first abmon string:

    abmon	"جنوری";/

The last letter in this string, ی U+06CC ARABIC LETTER FARSI YEH
is not convertible to CP1256.

But this letter seems to be really used in writing Urdu, see:

    https://en.wikipedia.org/wiki/Urdu_alphabet
    https://en.wikipedia.org/wiki/Urdu_alphabet#Ye

So I think CP1256 is not a suitable charset to use for Urdu.

    https://en.wikipedia.org/wiki/Windows-1256

says:

Wikipedia> Windows-1256 is a code page used to write Arabic (and possibly some

Note the “possibly”.

Wikipedia> other languages that use Arabic script, like Persian and Urdu) under
Wikipedia> Microsoft Windows.
Wikipedia> [...]
Wikipedia> Unicode and UTF-8 are preferred to Windows 1256 in modern
Wikipedia> applications. 0.1% of all web pages use Windows-1256 in June 2016.

So CP1256 doesn’t seem to be used much anymore.

And we don’t have a Urdu locale in that encoding either, our Urdu
locale uses only UTF-8 encoding.

So I think we should replace

    % Charset: CP1256

with 

    % Charset: UTF-8

in ur_PK.

-- 
Mike FABIAN <mfabian@redhat.com>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]