This is the mail archive of the guile@cygnus.com mailing list for the guile project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Jim Blandy <jimb@red-bean.com> writes: > Can you be more specific about why 32 bits are needed? Which > character sets does Unicode not accomodate? Or is that the wrong > question for me to ask? Unicode is suitable for all the currently supported languages using alphabets or syllable representations. This includes beside the latin, cyrillic, greek or hebrew based alphabets arabic african languages (those which are standardized) hangul hiragana katakana tamil bengali ... Unicode will not be able to represent all of Hanzi/Kanji/Hanza (the ideographs). The news coming from the Taiwanese committees let us assume they have another 30,000 ideographs and more to come. Beside this the Unicode consortium will now start allocating the next page of the charset, i.e., code points >= 0x10000 and this means that even for representing the complete Unicode repertoire you will need the escape notation. > An Emacs buffer must hold large amounts of text, and must also serve > as the operand to editing and searching commands. It is terribly > clumsy to use a variable-length encoding in buffers. Since the > buffer representation must be the foundation of all other i18n > support, it's important to get it right. Doubling the text storage > required isn't so unreasonable; quadrupling it is. I understand the problem and we were all through this two years ago. I don't know exactly who participated, but at least Per Bothner, François Pinard, RMS and I. We didn't came to a conclusion since RMS decided to go with the MULE encoding for Emacs 20 but in principal the direction was (if I remeber correctly, this might be colored by my own views :-) that UCS4 support is necessary and if space matters a lot while handling large texts different encodings should be supported at the same time. E.g., take an editor buffer. The text is normally separated in chunks and all these little buffers could either have byte, UCS2, or UCS4 encoding. It might even be possible to let the user choose the encoding for a given part of the buffer (in Emacs). I know this is also a philosophical question. Shall we support the (for us) not important languages and to do this spend lot of work and waster space. My position is that everybody should remember ASCII (or ISO 8859-?, KOI, ...) and learn from the faults. If you want a fixed width character set _AND_ support all languages (even those which will be standardized in future) yu have to support UCS4. Everything else is an optimization (OK, "UCS3", if this would exist, _could_ be enough but this isn't useful, is it?). -- Uli ---------------. drepper at gnu.org ,-. Rubensstrasse 5 Ulrich Drepper \ ,-------------------' \ 76149 Karlsruhe/Germany Cygnus Solutions `--' drepper at cygnus.com `------------------------