This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

i18n; wide characters; Guile



The Guile team is beginning to work on cleaning up Guile's support for
wide character sets, as part of a general push for i18n.  I would like
to hear your thoughts on which directions we should take.

At the moment, Guile Scheme has complex, and insufficient support for
wide character sets, which I won't go into.  We are considering
redesigning the string representation and the I/O ports.

If one uses variable-width characters, one has a host of problems: if
the interpreter attempts to conceal the fact that characters are
variable-width, it is difficult to make string-length, string-ref, and
string-set! work in constant time; string-set! might change the length
of the string; and so on.  If the interpreter exposes the
variable-width representation to the programmer, this just passes the
buck, making the programmer responsible for implementing the encoding.
Neither of these tactics are attractive, so variable-width characters
seem problematic.

If one tries to use multiple character encodings in memory, then one
should provide for transparent conversion when strings are compared,
combined, hashed, etc.  This sounds like a bad idea, too.

The MULE character representation seems like a bad idea to me, because
it has all the problems of both of the above techniques; its only
advantage is that it saves space if one uses only 8-bit characters.

Thus, my current inclinations:
- Use 16-bit characters in strings throughout.
- Prescribe the use of Unicode throughout.
- Provide functions to convert between Unicode character strings
  all other widely-used formats: UTF-8, UTF-7, Latin-1, and the JIS
  variants, as well as anything else people would like to contribute.
- Provide a separate "byte array" type, for applications which
  genuinely want this.

We may implement the 16-bit character strings in odd ways that save
space when the upper bytes of all the characters are zero, but that's
a separate issue.

What I'm most interested in is your advice regarding character sets
and (externally visible) text representations.  How would you
recommend we go about supporting wide character sets?  What do you
think of Unicode?