This is the mail archive of the
docbook-apps@lists.oasis-open.org
mailing list .
Re: Choosing a characterset for DocBook
- From: Jens Stavnstrup <js at ddre dot dk>
- To: "Christopher R. Maden" <crism at maden dot org>
- Cc: docbook-apps at lists dot oasis-open dot org
- Date: Fri, 15 Mar 2002 12:56:29 +0100 (CET)
- Subject: Re: DOCBOOK-APPS: Choosing a characterset for DocBook
On Fri, 15 Mar 2002, Christopher R. Maden wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> At 02:58 AM 3/15/02, Jens Stavnstrup wrote:
> >On Fri, 15 Mar 2002, Christopher R. Maden wrote:
> > > 1) Do all of your entities (i.e., files) have encoding declarations? What
> > > are they? Remember that UTF-8 is the default unless you explicitly
> > specify
> > > a different encoding (or use a byte-order mark, in which case UTF-16 is
> > the
> > > default).
> >
> >The encoding chosed is as stated above ISO-8859-1, and yes that is
> >specified in the XML desclaration statement.
>
> OK - then somehow SAXON isn't honoring that.
>
> > > 2) How are you invoking the parser? From within SAXON, obviously - is
> > > SAXON being called from the command line, or within another program? What
> > > exactly are the parameters it's being passed?
> >
> > >From Ant, no specific parameters specified (What are you BTW refering to
> >?)
> >
> >I am still using Saxon 6.4.4, and checking the Change history in 6.5.1, I
> >do not see any specific problem with using ISO-8859-1.
>
> SAXON definitely does not have a problem with ISO 8859-1. So somehow it's
> being told to expect UTF-8. Exactly what are you using in Ant to call
> SAXON? I haven't done a lot of work with Ant - is SAXON being instructed
> to read the documents from the filesystem, or are they being passed as a
> stream of some sort to SAXON?
Yes, Saxon reads from the filesystem. The "exact" ant commad is
<javac saxon.class ...
<arg line="file.xml file.xsl saxon.extensions=1"/>
<classpath="saxon.classpath"/>
So as you see, nothing special. You are right, that saxon do not have any
problem with ISO-8859-1.
>
> >My problem is not so much which encoding, I choose (If there any bugs
> >(e.g. characters the parser can't accept), I can fix them). But rather
> >trying to avoid my colleagues to ran into these issues.
>
> Once you can get SAXON to correctly read in ISO 8859-1 data, you shouldn't
> have any problems; nearly every Windows and UNIX tool in a western European
> environment can edit this encoding. The biggest problem you'll run into is
> Windows users using the 128-159 range for things like curly quotes and
> ellipses; these characters are control characters in ISO 8859-1, and while
> not illegal, will not mean what the Windows user thinks they mean.
This is exactly the issue. When Word users cut and paste from a word to
an xml doc also edited in word. Sometimes word add extra characters in the
128-159 range - which are invisible in the word document, which Saxon considers
UTF-8 and therefore comming to an arrupt halt.
Jens
>
> ~Chris
> - --
> Christopher R. Maden, Principal Consultant, crism consulting
> DTDs/schemas - conversion - ebooks - publishing - Web - B2B - training
> <URL: http://crism.maden.org/consulting/ >
> PGP Fingerprint: BBA6 4085 DED0 E176 D6D4 5DFC AC52 F825 AFEC 58DA
> -----BEGIN PGP SIGNATURE-----
> Version: PGP Personal Privacy 6.5.8
>
> iQA/AwUBPJHVv6xS+CWv7FjaEQKwugCffMf14Ez0TdWE3EuyrGhaZnJGQHUAn3jn
> mFt26glbd7bgFtn2+LqSkP7n
> =qMy1
> -----END PGP SIGNATURE-----
>
--
------------------------------------------------------------------------
Jens Stavnstrup Phone :
Danish Defence Research Establishment Voice : + 45 - 39 15 17 97
Ryvangs Alle 1 - P.O. Box 2715 Fax : + 45 - 39 29 15 33
DK - 2100 Copenhagen O. E-Mail (Internet) :
Denmark js@ddre.dk
------------------------------------------------------------------------