This is the mail archive of the docbook-apps@lists.oasis-open.org mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Choosing a characterset for DocBook


On Fri, 15 Mar 2002, Christopher R. Maden wrote:

> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> At 02:58 AM 3/15/02, Jens Stavnstrup wrote:
> >On Fri, 15 Mar 2002, Christopher R. Maden wrote:
> > > 1) Do all of your entities (i.e., files) have encoding declarations?  What
> > > are they?  Remember that UTF-8 is the default unless you explicitly 
> > specify
> > > a different encoding (or use a byte-order mark, in which case UTF-16 is 
> > the
> > > default).
> >
> >The encoding chosed is as stated above ISO-8859-1, and yes that is
> >specified in the XML desclaration statement.
> 
> OK - then somehow SAXON isn't honoring that.
> 
> > > 2) How are you invoking the parser?  From within SAXON, obviously - is
> > > SAXON being called from the command line, or within another program?  What
> > > exactly are the parameters it's being passed?
> >
> > >From Ant, no specific parameters specified (What are you BTW refering to
> >?)
> >
> >I am still using Saxon 6.4.4, and checking the Change history in 6.5.1, I
> >do not see any specific problem with using ISO-8859-1.
> 
> SAXON definitely does not have a problem with ISO 8859-1.  So somehow it's 
> being told to expect UTF-8.  Exactly what are you using in Ant to call 
> SAXON?  I haven't done a lot of work with Ant - is SAXON being instructed 
> to read the documents from the filesystem, or are they being passed as a 
> stream of some sort to SAXON?


Yes, Saxon reads from the filesystem. The "exact" ant commad is

  <javac saxon.class   ...
      <arg line="file.xml file.xsl saxon.extensions=1"/>
      <classpath="saxon.classpath"/>

So as you see, nothing special. You are right, that saxon do not have any 
problem with ISO-8859-1.



> 
> >My problem is not so much which encoding, I choose (If there  any bugs
> >(e.g. characters the parser can't accept), I can fix them). But rather
> >trying to avoid my colleagues to ran into these issues.
> 
> Once you can get SAXON to correctly read in ISO 8859-1 data, you shouldn't 
> have any problems; nearly every Windows and UNIX tool in a western European 
> environment can edit this encoding.  The biggest problem you'll run into is 
> Windows users using the 128-159 range for things like curly quotes and 
> ellipses; these characters are control characters in ISO 8859-1, and while 
> not illegal, will not mean what the Windows user thinks they mean.


This is exactly the issue. When Word users cut and paste from a word to 
an xml doc also edited in word. Sometimes word add extra characters in the 
128-159 range - which are invisible in the word document, which Saxon considers 
UTF-8 and therefore comming to an arrupt halt.


Jens


> 
> ~Chris
> - -- 
> Christopher R. Maden, Principal Consultant, crism consulting
> DTDs/schemas - conversion - ebooks - publishing - Web - B2B - training
> <URL: http://crism.maden.org/consulting/ >
> PGP Fingerprint: BBA6 4085 DED0 E176 D6D4  5DFC AC52 F825 AFEC 58DA
> -----BEGIN PGP SIGNATURE-----
> Version: PGP Personal Privacy 6.5.8
> 
> iQA/AwUBPJHVv6xS+CWv7FjaEQKwugCffMf14Ez0TdWE3EuyrGhaZnJGQHUAn3jn
> mFt26glbd7bgFtn2+LqSkP7n
> =qMy1
> -----END PGP SIGNATURE-----
> 

-- 

------------------------------------------------------------------------
Jens Stavnstrup                            Phone :
Danish Defence Research Establishment         Voice : + 45 - 39 15 17 97
Ryvangs Alle 1 - P.O. Box 2715                Fax   : + 45 - 39 29 15 33
DK - 2100 Copenhagen O.                    E-Mail (Internet) :
Denmark                                       js@ddre.dk
------------------------------------------------------------------------






Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]