This is the mail archive of the
xsl-list@mulberrytech.com
mailing list .
Re: Special Characters in URLs
- To: xsl-list at lists dot mulberrytech dot com
- Subject: Re: [xsl] Special Characters in URLs
- From: Mike Brown <mike at skew dot org>
- Date: Tue, 19 Jun 2001 10:58:25 -0600 (MDT)
- Reply-To: xsl-list at lists dot mulberrytech dot com
Eriksson Magnus wrote:
> Yes, the URIs are interpreted by the Web Server/Web browser but I need them
> to be generated correctly by the XSLT processor -- to comply with the
> HTTP-standard (e.g. no white space in URLs). Is there a way to achieve this?
Re: the encoding:
The encoding of the document as a whole has no bearing on the %-style
escaping of characters in a URI. So for example if you have in your
stylesheet
<xsl:output method="html" encoding="iso-8859-1">
and
<a href="http://skew.org/printenv?greeting={greeting}">click</a>
and your XML has:
<greeting>¡Hola!</greeting>
then your output should end up like:
<a href="http://skew.org/printenv?greeting=%C3%A1Hola!">click</a>
You may have thought that the last 6 characters of that URI reference
would be bytes like:
¡ H o l a !
A1 48 6F 6C 61 21 <-- iso-8859-1 bytes
because if you just did <xsl:value-of select="greeting"/> that is
precisely what you would get.
The reason it changes when the XSL processor emits it in an href attribute
is because of this clause in the XSLT spec: "The html output method should
escape non-ASCII characters in URI attribute values using the method
recommended in Section B.2.1 of the HTML 4.0 Recommendation". And that
section says to use UTF-8 as the basis for the %-escaping of the URI. This
means you likely get this in the output:
% C 3 % A 1 H o l a !
25 43 33 25 41 31 48 6F 6C 61 21 <-- iso-8859-1 bytes, still
See, you *did* get iso-8859-1 output like you asked for. The UTF-8-ness is
actually at a higher level of abstraction.
Note that this escaping happens *only* for non-ASCII characters
(U-00000080 and higher). So it does not affect those ASCII characters that
are reserved or disallowed in a URI, like " ", among others.
Even if the XSLT processor failed to do the UTF-8 based escaping of
non-ASCII characters, the HTML user agents are supposed to do it when
interpreting the URI reference anyway.
Of course your problem is on the server end. Chances are, you are coding
using an API that expects iso-8859-1 as the basis for the URL escaping,
which is perfectly reasonable to do, especially in light of the fact that
browsers tend to send URL-encoded form data with the URL-escaping being
based on the actual encoding of the document containing the form (rather,
the encoding that the browser is assuming the containing document is
using; this is user-overridable).
If you make the containing document utf-8 instead of iso-8859-1, you can
assume that all the escaping is UTF-8 based, and then you can convert the
misinterpreted-as-iso-8859-1 strings you get from the form data API back
to iso-8859-1 bytes and then read these bytes back into a string using
utf-8 interpretation.
Your other option is to avoid putting the raw non-ASCII characters in the
URI refs in the first place. If you absolutely must have %A1 for inverted
exclamation mark, then the only way to ensure this is to make your
stylesheet put %A1 in the result tree. You can do this using an extension
function (ideal) or with a clever recursive template.
Re: escaping of ASCII characters like " " (space), you must also control
this in your stylesheet. If you want "+" or "%20" (the latter is
preferable), then have your stylesheet explicitly put that in the result
tree.
See also: http://skew.org/xml/misc/URI-i18n/
Hope this helps.
- Mike
_____________________________________________________________________________
mike j. brown, software engineer at | xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA | personal: http://hyperreal.org/~mike/
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list