This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Handling of locale character sets


Hi,

I'm looking for some pointers on how character sets are handled
internally by glibc.  While I've read the relevant standards
(C, POSIX/SUS, and also had a peek at the source, there are some
details that I would like to get clarified if possible.

1) Points of character set conversion
-------------------------------------

As far as I'm aware, there are a few points at which the
character set of a string literal may be translated from
the original source code to displaying it to the user:

- source encoding to execution charset encoding (gcc)
- execution charset encoding to output character set
  (locale codeset) (glibc)
- terminal display (terminal software)

While a few programs to test this do show that UTF-8-encoded
string literals are correctly transformed to the locale codeset,
I'm unsure exactly *how* glibc knows the encoding of the string
in the ELF object.  For wide strings, AFAICT these are always
UTF-32, but for narrow strings I don't get how this is done.
As an example, UTF-8 literals are correctly converted to ISO-8859-1
in a Latin-1 locale.

I assume that some conversion is done when using formatted
print functions such as printf when data is put into the
stream buffer, which has an associated locale(?)  Is the
conversion done on the format string, or also strings
referenced by the format string?  Is the charset for each
separate string known?
[I'm thinking here of object files compiled with different
-fexec-charset charsets]


2) Internals of locale charset handling
---------------------------------------

Looking at the sources in locale/, I can see the default
C locale definitions which are hardcoded into glibc.  Is there
any documentation about the locale data structures internal to
glibc?

#define STRUCT_CTYPE_CLASS(p, q) \
  struct                                                                      \
    {                                                                         \
      uint32_t isctype_data[8];                                               \
      uint32_t header[5];                                                     \
      uint32_t level1[1];                                                     \
      uint32_t level2[1 << q];                                                \
      uint32_t level3[1 << p];                                                \
    }

How do the "levels" work here and what is the format for the header
and isctype_data?

Is the C locale data in exactly the same format as what gets
mapped in for all other locales?

I'm interested in altering the C locale on my system to use
a UTF-8 codeset in place of ASCII, and I just want to get a
handle on how it all works, and exactly what I need to do to
tackle this.

For POSIX/SUS compliance this will necessarily have the same
restrictions on collation that the existing ASCII C locale has
for some types, such as digits.

[The reason for doing this is to have UTF-8 as default for everything,
with UTF-8 as the lowest-common-denominator charset on the system,
with everything handling UTF-8 input and output by deafult; currently
some things break with UTF-8 input in the C (and only the C) locale,
so I'd like to see if this is a possible solution.  And I also thought
it would be fun!]


Many thanks,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]