This is the mail archive of the
docbook-apps@lists.oasis-open.org
mailing list .
Re: Bad Continuation of Multi-Byte UTF-8 Sequence
First of all, apologies for my misunderstanding. But in a way I'm glad
because it let you expand your ideas to state:
> [...] I can see the day when the encoding will need to change
> within a single file.
I have such a file. It's name is mbox. Not in XML, but the biggest problem
with having a file in multiple encodings that I can see is not being able to
grep and/or edit it easily. If such a day comes as you suggest, tools will
have to be revised to deal with it better. (Yes, mail clients do deal with
this particular issue very well.)
I've considered using DocBook for multiple languages, where one document
contains various languages. I can easily see this as being a case where
multiple encodings would be necessary. (No, I haven't gotten it to work as
getting the right combination of tags with lang="xx" with what the DTD allows
for children isn't easy. SmartDoc was designed to handle this case better.)
Nonetheless, isn't this where Unicode comes in to save the day? (I know
about the faults in Unicode as some friends have to use more common gliphs
for their names when registering with Unicode based software.) If one has
the gliphs, typing in multiple languages (each normally with multiple
encodings) becomes possible in a single file. I'm curious as to why you
would prefer to use multiple encodings in a single file over UTF-8. Or am I
misinterpreting your statement again?
By the way, before I started to "get" Unicode, I also wanted to be able to
specify multiple encodings in a given file. I don't like that some friends
have gliphless names, but when everthing is converted to and processed in
Unicode anyway, why fight it on the input file side?
>> [Case of Shift_JIS encoded XML with EUC-JP encoded XSL(T) snipped]
>
> Fair judgement, with the case you state. I'm presuming that multiple
> encoding fragments will become a norm rather than an exception. I guess
> processors will gradually align as code becomes more available.
Actually, while it is reasonable to have, for example, a Japanese based XSL
set for dealing with the DocBook DTD in one of the major encodings, it makes
more sence to have one encoding decided on for a given project, and use that
encoding throughout the project. And I think that for projects developed in
environments with a single language and multiple possible encodings, deciding
on a single encoding to use is more the norm.
(That reminds me, I need to fill out a bug report to have a font
specification for the bullets. TM, Circle-R, Circle-C, and a few others
cause errors using a Japanese Font-Family [the gliphs don't exist in them]
with FOP. I'll try to do that today.)
Where the fragmentation is more likely to take place is in database storage.
Accessing multiple data sources may very well produce XML trees in different
encodings. But there, too, a standard (UTF-8?) will most likely become the
standard encoding. (Gee, I say "most likely" a lot. Am I that unsure? ;-)
Thank you for the interesting ideas for a Monday morning to get the grey
cells working.
--
Michael Westbay
Work: Beacon-IT http://www.beacon-it.co.jp/
Home: http://www.seaple.icc.ne.jp/~westbay
Commentary: http://www.japanesebaseball.com/