This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: invalid character (Unicode: 0xa0) in xsl document - LONG


The essay below may or may not pertain to your actual problem. However,
it may very likely be that your XML is declaring itself to be
encoded as UTF-8 without that actually being the case. 

jackson wrote:
> 
> Alan
> 
> > I'm processing an xsl file with the apache xalan 2 processor, and am
> > getting the following error message when i run my application:
> >
> > javax.xml.transform.TransformerConfigurationException: An invalid XML
> > character (Unicode: 0xa0) was found in the element content of the
> > document.
> 
> Well, your document says it's UTF-8. I'm not an expert on Unicode
> and related issues, but i think 0xa0, while it is Unicode, is not a possible
> UTF-8 character.
> 
> The character 0xa0 is a non-breaking space. I don't know how
> it might have got in your document (possibly from some HTML?),
> but you could find it and get rid of it. Since it's white space, it's
> not going to be obvious.
> 
> You could write a script to look for this character and change
> it - say, to a normal space. You could also do it in your java
> program i suppose, before parsing.
> 
> I suppose you could also turn 0xa0 into the UTF-8 equivalent
> (i can't help you there). Java classes might be able to do it for
> you - from what i remember (quite a while ago), there is a class
> for writing to a UTF file?
> 
> David Jackson
> 

A brief note before the long-winded part: I suspect you are referring
to the DataInputStream and DataOutputStream classes, which have 
methods to readUTF() and writeUTF(). These methods read and write a 
modified form of UTF-8 that will not be meaningful to a 
standards-compliant processor.  Specifying an encoding name to the 
constructor of an InputStreamReader or OutputStreamWriter will work,
as will passing an encoding name to the String method getBytes().

Your other option is to figure out what encoding your system uses
by default and declare that in the encoding attribute in your XML
prolog. However, the only two encodings required for all XML processors
by the standard are UTF-8 and UTF-16.

Now for the long part:

UTF-8 is a method for representing Unicode characters (16 bit values)
on a stream of 8-bit units. Given that a large volume of data is still
primarily composed of the traditional ASCII characters, which require
only 7 bits to represent, using 16 bits per character would be quite
inefficient. UTF-8 uses 8 bits with the sign bit 0 to represent
characters that fall into the ASCII range in a single octet. For 
character codes that are larger, more than one byte is used. The leading
bits of the first octet are used to indicate (1) that more than one
octet should be read and (2) how many. The following octets begin
with a pattern that indicates that they are not the start of a
character.
The remaining bits in each octet are then used to hold the actual value
being stored. 

The overall effect is that if your data is all ASCII, the UTF-8
encoding comes out just like a traditional ASCII file - one 
character for every 8-bits. You can create and read such files
with traditional software that never actually heard of UTF-8.
If it uses characters whose codes are
>= 128, it will translate those into multiple octets and a system that
is not making the appropriate interpretations will come up with an
error.

XML requires all XML processors to
support UTF-8, and the prolog <?xml version="1.0" encoding="UTF-8" ?>
has been added to a great number of XML files as a hard-coded string,
based in part on copying examples.
The data in those files is then generated by a system that may not
be aware of what UTF-8 really means and use some other actual
encoding scheme (Cp1252 aka winAnsi aka Windows-Latin-1, for example).
The end result is that the XML processor expects UTF-8 encoding, 
finds a bit pattern that is not valid in UTF-8, and screams.

In Java, a character is an unsigned 16 bit value containing a 
Unicode character code. When reading or writing characters from
8-bit byte oriented streams or buffers, many Java classes give the
option of specifying the name of an encoding to use and apply a
system default otherwise. The String method getBytes("UTF8") 
would return a buffer of bytes representing the String's characters 
using the UTF-8 encoding. Alternatively, you could wrap an
OutputStreamWriter around your actual OutputStream with the
encoding set in the constructor.

Hope this helps.

Eric Jacobson

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]