Locales in GLIBC

One large and relatively independent part of glibc is the locale API and definitions of concrete locales; related to this is the subsystem dealing with various charsets and converting between them.

Locale data

The locale definitions are in a specific file format; some notes on it can be found in locale(5) manpage, but they are sketchy at best. POSIX also describes the format and some fields, but not all that are commonly used in glibc (e.g. week start definitions).

All strings in the file use Unicode entity specifications instead of plain characters; to quickly inspect the file, gcc -o show-ucs-data localedata/show-ucs-data.c (no need to do any build preparation for this, not even ./configure) and then just ./show-ucs-data localedata/locales/en_US.

Externally available data for a locales can be useful for cross-referencing including ICU - International Components for Unicode.

Contributing

Contribute the locale updates in the form of bugs in the glibc bugzilla. However, when contributing locale updates, always try to get in touch with the locale maintainer first; if this is unsuccessful, try to describe the changes you have made, and (this is important) provide some proofs that this reflects common usage - e.g. local government or big newspapers sites, references to language norms, etc.

Week start

More and more applications start to rely on locale data when displaying calendar views, etc. - in some locales, the first column should be Monday, in others Sunday is appropriate. Unfortunately, this data is not currently in very good shape, in part given the confusing way the week start is declared.

There are three keywords for the LC_TIME section related to this:

Furthermore, there is the question of the day keyword and which day of week should its list start with. Specs say Sunday, but they do not mention any of the week start specifiers. Applications aware of these tend to interpret the day list in a more complicated way.

The tricky thing is how to reconcile information from WEEKSTARTDATE and first_weekday. PetrBaudis wrote some lenghty treatises about this on libc-locales; we present the outcome and thus our de facto current interpretation:

Thus, e.g. en_GB definition (English locale with week starting on Monday) is:

 week          7;19971130;4
 first_weekday 2
 first_workday 2
 day           "Sunday;Monday;Tuesday;Wednesday;Thursday;Friday;Saturday"
 abday         "Sun;Mon;Tue;Wed;Thu;Fri;Sat"

When your locale is compiled, you can use a simple first_weekday test tool to check the day definitions are correct.

Charsets

The iconv conversion internally always works by converting from source charset to UCS-4 and then from UCS-4 to the target charset. This implies that the charset modules need to implement only to/from Unicode mapping, and that characters not in Unicode are not convertable (luckily, this seems to be currently the case only for few obscure ancient kanji characters).

Most of the charsets are simple (single-byte with direct 1-1 Unicode mapping). .c files for these are trivial, depending on data provided by .h files, autogenerated by iconvdata/gen-8bit.sh from localedata/charmaps/ files at build time.

None: Locales (last edited 2012-05-09 21:39:12 by CarlosODonell)