This is the mail archive of the
libc-locales@sourceware.org
mailing list for the GNU libc locales project.
Re: locale encodings
- From: Troy Korjuslommi <tjk at tksoft dot com>
- To: Steven Abner <pheonix at zoomtown dot com>
- Cc: libc-locales at sourceware dot org, Carlos O'Donell <carlos at redhat dot com>, Keld Simonsen <keld at keldix dot com>
- Date: Thu, 14 Nov 2013 09:50:05 +0200
- Subject: Re: locale encodings
- Authentication-results: sourceware.org; auth=none
- References: <31AACAB8-A716-47CC-B755-F33DD77BA51E at zoomtown dot com> <1384174607 dot 4028 dot 8 dot camel at uno11 dot loco> <20131112012257 dot GA31828 at rap dot rap dot dk> <5281BEB1 dot 2010909 at redhat dot com> <20131112133642 dot GA22738 at rap dot rap dot dk> <98244D14-49A6-4953-8F6B-9D393E435324 at zoomtown dot com> <EC3F7154-A278-4126-B33C-10E107B63BD9 at zoomtown dot com>
By the way, I ran some tests on the fi_FI locale for glibc-2.18 and it
seems to contain out of date information in regards to collation. The
correct collation order/data are specified in Finnish standard SFS-EN
13710 published in 2011 (Finnish standard based on EN 13710 ~aka ISO/IEC
14651) and CLDR, and implemented in ICU. Quick look at the fi_FI file
tells me that at least the dates are off, which would imply the data
being off. The collation errors seem to be diacritic related, so I would
have to go through the actual data to determine whether the error is in
strcoll's dealing with UTF-8 or the collation data. The collation data
seems to be the most likely suspect. Keld, your name is listed as the
contact, so maybe best that you check this out. In case only the
comments are off. Also, the charset is wrong. It is listed as iso-8859-1
for fi_FI and iso-8859-15 for fi_FI@euro. The correct charset for
Finnish is UTF-8. Only UTF-8 includes all the characters included in the
current standards.
Since EN 13710 specifies a European collation order, it should also be
used in other Europan locales as the default sorting order.
I've tried to push for more cooperation with CLDR in the past too, and
here is a good case in point why it would actually be a good idea to
keep an eye on CLDR. There is no need to automate the process
(difficulty of which seems to be the reason for resisting CLDR), just
get the relevant data. Running comparison tests between cldr and libc
would also be a good idea. ICU is pretty up-to-date in terms of CLDR and
other Unicode.org data, so that would be an easy way to implement the
tests.
Troy
On Tue, 2013-11-12 at 10:37 -0500, Steven Abner wrote:
> On 12 Nov 2013, at 9:34 AM, Steven Abner wrote:
>
> > all data that is important, save one, is in POSIX's 7-bit ASCII
>
> I wish to add, the quoted strings however are UTF8 instead of the default set. Off the top of my
> head, the JP file has quoted ("") strings for correct display of months, hours, etc. in UTF8.
> As far as embedded, a Japanese microwave doesn't need UTF8 for display, but the designer
> who butchers the code for the microwave, even a Japanese one, can readily use UTF8 to set up
> JIS0201 or even their own proprietary 128 or less byte display code, and internal communications.
> That same designer could use UTF8, and default character information from glibc locales to
> create an embedded version of a code set for microwaves in China.
> Not saying this is standard, but my point was, I guess, is default character set for the locale could
> or should go into the ASCII section of "LC" data. Comments in any encoding get gobbled, quoted
> strings either in default character set or UTF8.
> I am no expert, just food for thought.
> Steve