This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Locale/charset combinations


Hi,

Why are currently so few locales/character set combinations supported?
There must be hundreds if not thousands of valid combinations for which
no entry exists in localedata/SUPPORTED. For certain character sets that
are defined there are no valid locale/charset cominations. This makes it
impossible to use tools like sed and grep using the [:alpha:] set of
regular expressions on files encoded with such character sets (unless I
am overseeing a way to override the used character set for a given
locale).

Of course it is possible to first convert such files using iconv, but it
would be much more convenient to be able to use grep and sed on the raw
data.

Because the definition of the character codes in the character maps
doesn't have any impact on the other locale settings, the introduction
of many more locale/charset combinations shouldn't cause any breakage
afaict (even for incorrectly chosen character sets).

To me the lack of a locale supporting the character set is most notable
for CP1252 (aka "MS-ANSI"). Much of the data I have to work with uses
this encoding, mostly in the form of MySQL table dumps using the latin1
charset - yes, MySQL's latin1 is CP1252, not ISO-8859-1, see
http://dev.mysql.com/doc/refman/5.0/en/charset-mysql.html.

Since the CP1252 code page and ISO-8859-1 and ISO-8859-15 are very
similar I suppose any locale currently supporting ISO-8859-1[5] should
be able to use CP1252.

It thus shouldn't be very intrusive to add the output of

$ for locale in $(grep "ISO-8859-1\(5\)\?\ " localedata/SUPPORTED | cut
-f 1 -d / | cut -f 1 -d \. | sort -u); do echo "${locale}.CP1252/CP1252
\\"; done

to localedata/SUPPORTED.

The same of course is true for f.e. the IBM and EBCDIC code pages and
their corresponding locale. F.e:

en_US.IBM437/IBM437 \
nl_NL.IBM437/IBM437 \
nl_NL.IBM850/IBM850 \

and

de_AT.EBCDIC-AT-DE/EBCDIC-AT-DE \
de_AT.EBCDIC-AT-DE-A/EBCDIC-AT-DE-A \
fr_CA.EBCDIC-CA-FR/EBCDIC-CA-FR \

should be valid additions to localedata/SUPPORTED.

There are a couple of locale/charset combinations currently unsupported
for which a character set is mentioned in the locale file:

$ grep -i charset az_AZ bg_BG mk_MK POSIX ur_PK wal_ET
az_AZ:% Charset: ISO-8859-9E
bg_BG:% this: bg_BG.CP1251 (CP1251 is for coresponding charset),
bg_BG.KOI8R,
mk_MK:% Charsets: UTF-8, ISO-8859-5, CP1251
POSIX:# Charset: ISO646:1993
ur_PK:% Charset: CP1256
wal_ET:% Charset: UTF-8

Thus the following should be valid additions to localedata/SUPPORTED:

az_AZ.ISO-8859-9/ISO-8859-9 \
bg_BG.KOI8R/KOI8R \
bg_BG.ISO-8859-5/ISO-8859-5 \
mk_MK.CP1251/CP1251 \
ur_PK.CP1256/CP1256 \
wal_ET/UTF-8 \

Note that I've left the defaults as are, so I add
ur_PK.CP1256/CP1256 \
instead of
-ur_PK/UTF-8 \
+ur_PK/CP1256 \
+ur_PK.UTF-8/UTF-8 \

Not sure about these as a POSIX locale is present although undefined in
localedata/SUPPORTED:

POSIX.ISO_646.BASIC \
POSIX.UTF-8/UTF-8 \

Leonard.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]