This is the mail archive of the libc-alpha@sourceware.cygnus.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

using iconv for conversion from/to Unicode



Hi Ulrich,

Would it be possible to add to glibc two encodings:

  (a) UCS-2 with the endianness and alignment restrictions of the running CPU,
      without byte order mark, i.e. arrays of uint16_t,

  (b) UCS-4 with the endianness and alignment restrictions of the running CPU,
      without byte order mark, i.e. arrays of uint32_t,

Proposed names: "uint16_t" and "uint32_t". Or "UCS-2-INTERNAL" and
"UCS-4-INTERNAL".

This would be a great help for programs that use Unicode as their internal
representation. Some such programs use UTF-8 as their internal string
representation, but some others use uint16_t[] or uint32_t[]. Currently
such programs, in order to avoid endianness and BOM issues, have to
convert in two different steps: from the locale dependent encoding to
UTF-8 via iconv(), then from UTF-8 to uint16_t[] or uint32_t[] via a
self-written recoding loop. This wastes programmers' efforts and CPU cycles.

"UCS-2-INTERNAL" would not be hard to implement: This is just a #ifdef
choice between "UNICODEBIG" and "UNICODELITTLE", both already implemented
in glibc.

"UCS-4-INTERNAL" would not be hard to add either: It's already glibc's
internal encoding, but unfortunately you can't convert from/to it using
iconv().

Whatever new names you choose in glibc, they will be supported by the next
versions of 'libiconv' and 'recode'. Therefore don't worry about portability.
(glibc is the only system libc with a usable iconv() anyway...)

Bruno

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]