This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

RE: invalid character (Unicode: 0xa0) in xsl document - LONG


This is correct -- 0xA0 cannot appear as the first byte of a UTF-8
sequence [1].  This character could easily appear as the second byte of
a two-byte sequence, and I could also see the error appearing IF you
receive a UTF-8 file that does not have a BOM, and is in a different
byte-order than your system expects (for example, little-endian, and
your system uses big-endian for two-byte sequences).  In this case, the
parser would (perhaps) assume the preferred byte order, and since 90% of
the file is single-byte characters anyway, it would not die until it
reaches a sequence that has two bytes or more (perhaps A0E0 in
little-endian, your processor would be expecting big-endian, so would
expect to see that character as E0A0, and would see instead a character
starting with A0 and would throw the error you are seeing).  So this
error could very well occur when exchanging valid UTF-8 with no BOM
between systems with differing byte-orders.  Lesson is, always use a BOM
:-)

Also note that just using an encoding stream that does UTF-8 as
suggested below will not solve all of your problems.  There are
characters which are not valid XML [2], but which are perfectly valid
UTF-8.  I am not aware of any streamwriters that automatically strip
these out for you.


[1] http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html
(see table 3.1b)
[2] http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char

Regards,
Joshua


> -----Original Message-----
> From: Eric Jacobson [mailto:ericjacobson@mediaone.net]
> Sent: Saturday, April 28, 2001 8:47 PM
> To: xsl-list@lists.mulberrytech.com
> Subject: Re: [xsl] invalid character (Unicode: 0xa0) in xsl document -
> LONG
> 
> The essay below may or may not pertain to your actual problem.
However,
> it may very likely be that your XML is declaring itself to be
> encoded as UTF-8 without that actually being the case.
> 
> jackson wrote:
> >
> > Alan
> >
> > > I'm processing an xsl file with the apache xalan 2 processor, and
am
> > > getting the following error message when i run my application:
> > >
> > > javax.xml.transform.TransformerConfigurationException: An invalid
XML
> > > character (Unicode: 0xa0) was found in the element content of the
> > > document.
> >
> > Well, your document says it's UTF-8. I'm not an expert on Unicode
> > and related issues, but i think 0xa0, while it is Unicode, is not a
> possible
> > UTF-8 character.
> >
> > The character 0xa0 is a non-breaking space. I don't know how
> > it might have got in your document (possibly from some HTML?),
> > but you could find it and get rid of it. Since it's white space,
it's
> > not going to be obvious.
> >
> > You could write a script to look for this character and change
> > it - say, to a normal space. You could also do it in your java
> > program i suppose, before parsing.
> >
> > I suppose you could also turn 0xa0 into the UTF-8 equivalent
> > (i can't help you there). Java classes might be able to do it for
> > you - from what i remember (quite a while ago), there is a class
> > for writing to a UTF file?
> >
> > David Jackson
> >
> 
> A brief note before the long-winded part: I suspect you are referring
> to the DataInputStream and DataOutputStream classes, which have
> methods to readUTF() and writeUTF(). These methods read and write a
> modified form of UTF-8 that will not be meaningful to a
> standards-compliant processor.  Specifying an encoding name to the
> constructor of an InputStreamReader or OutputStreamWriter will work,
> as will passing an encoding name to the String method getBytes().
> 
> Your other option is to figure out what encoding your system uses
> by default and declare that in the encoding attribute in your XML
> prolog. However, the only two encodings required for all XML
processors
> by the standard are UTF-8 and UTF-16.
> 
> Now for the long part:
> 
> UTF-8 is a method for representing Unicode characters (16 bit values)
> on a stream of 8-bit units. Given that a large volume of data is still
> primarily composed of the traditional ASCII characters, which require
> only 7 bits to represent, using 16 bits per character would be quite
> inefficient. UTF-8 uses 8 bits with the sign bit 0 to represent
> characters that fall into the ASCII range in a single octet. For
> character codes that are larger, more than one byte is used. The
leading
> bits of the first octet are used to indicate (1) that more than one
> octet should be read and (2) how many. The following octets begin
> with a pattern that indicates that they are not the start of a
> character.
> The remaining bits in each octet are then used to hold the actual
value
> being stored.
> 
> The overall effect is that if your data is all ASCII, the UTF-8
> encoding comes out just like a traditional ASCII file - one
> character for every 8-bits. You can create and read such files
> with traditional software that never actually heard of UTF-8.
> If it uses characters whose codes are
> >= 128, it will translate those into multiple octets and a system that
> is not making the appropriate interpretations will come up with an
> error.
> 
> XML requires all XML processors to
> support UTF-8, and the prolog <?xml version="1.0" encoding="UTF-8" ?>
> has been added to a great number of XML files as a hard-coded string,
> based in part on copying examples.
> The data in those files is then generated by a system that may not
> be aware of what UTF-8 really means and use some other actual
> encoding scheme (Cp1252 aka winAnsi aka Windows-Latin-1, for example).
> The end result is that the XML processor expects UTF-8 encoding,
> finds a bit pattern that is not valid in UTF-8, and screams.
> 
> In Java, a character is an unsigned 16 bit value containing a
> Unicode character code. When reading or writing characters from
> 8-bit byte oriented streams or buffers, many Java classes give the
> option of specifying the name of an encoding to use and apply a
> system default otherwise. The String method getBytes("UTF8")
> would return a buffer of bytes representing the String's characters
> using the UTF-8 encoding. Alternatively, you could wrap an
> OutputStreamWriter around your actual OutputStream with the
> encoding set in the constructor.
> 
> Hope this helps.
> 
> Eric Jacobson
> 
>  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]