This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: i18n; wide characters; Guile


Jim Blandy <jimb@red-bean.com> writes:

> Can you be more specific about why 32 bits are needed?  Which
> character sets does Unicode not accomodate?  Or is that the wrong
> question for me to ask?

Unicode is suitable for all the currently supported languages using
alphabets or syllable representations.  This includes beside the
latin, cyrillic, greek or hebrew based alphabets

	arabic
	african languages (those which are standardized)
	hangul
	hiragana
	katakana
	tamil
	bengali
	...

Unicode will not be able to represent all of Hanzi/Kanji/Hanza (the
ideographs).  The news coming from the Taiwanese committees let us
assume they have another 30,000 ideographs and more to come.

Beside this the Unicode consortium will now start allocating the next
page of the charset, i.e., code points >= 0x10000 and this means that
even for representing the complete Unicode repertoire you will need
the escape notation.

> An Emacs buffer must hold large amounts of text, and must also serve
> as the operand to editing and searching commands.  It is terribly
> clumsy to use a variable-length encoding in buffers.  Since the
> buffer representation must be the foundation of all other i18n
> support, it's important to get it right.  Doubling the text storage
> required isn't so unreasonable; quadrupling it is.

I understand the problem and we were all through this two years ago.
I don't know exactly who participated, but at least Per Bothner,
François Pinard, RMS and I.  We didn't came to a conclusion since RMS
decided to go with the MULE encoding for Emacs 20 but in principal the
direction was (if I remeber correctly, this might be colored by my own
views :-) that UCS4 support is necessary and if space matters a lot
while handling large texts different encodings should be supported at
the same time.

E.g., take an editor buffer.  The text is normally separated in chunks
and all these little buffers could either have byte, UCS2, or UCS4
encoding.  It might even be possible to let the user choose the
encoding for a given part of the buffer (in Emacs).


I know this is also a philosophical question.  Shall we support the
(for us) not important languages and to do this spend lot of work and
waster space.  My position is that everybody should remember ASCII (or
ISO 8859-?, KOI, ...) and learn from the faults.  If you want a fixed
width character set _AND_ support all languages (even those which will
be standardized in future) yu have to support UCS4.  Everything else
is an optimization (OK, "UCS3", if this would exist, _could_ be enough
but this isn't useful, is it?).

-- Uli
---------------.      drepper at gnu.org  ,-.   Rubensstrasse 5
Ulrich Drepper  \    ,-------------------'   \  76149 Karlsruhe/Germany
Cygnus Solutions `--' drepper at cygnus.com   `------------------------