This is the mail archive of the
xsl-list@mulberrytech.com
mailing list .
Canonical XML in Databases (Re: sort problem)
- From: Peter Davis <pdavis152 at attbi dot com>
- To: xsl-list at lists dot mulberrytech dot com
- Cc: aruniima dot chakrabarti at iflexsolutions dot com
- Date: Thu, 5 Sep 2002 00:16:43 -0700
- Subject: Canonical XML in Databases (Re: [xsl] sort problem)
- References: <CBF6DBC01C62C64DA820DCFCD48E05C8AEFF33@fmg-nt.spz.i-flex.com>
- Reply-to: xsl-list at lists dot mulberrytech dot com
On Thursday 05 September 2002 00:07, you wrote:
> [the output XML] will be treated as text at a later point. We need to match
> xml
> from a database of some 100 xmls & find a match for the same. The problem
> is that to match 2 xmls, we will be using text comparison as
> 1. the database mite not support xml parsers
> 2. DOM matching wud be very time consuming...
> so the requirement is to store all files as text sorted as they will be
> treated as text only files by the database...
> in case there is a better way to find xml matches, do help me out on the
> same too...
(Hope you don't mind if I post this to the list -- I think it is an
interesting question.)
Hmm, I have to say that isn't a very robust way to go about it. There are
several assumptions you have to make that can be broken by any piece of your
system.
A lot of thought has been put into this problem, and the answer is even more
complicated than just comparing two DOMs. See this W3C recommendation for
canonicalizing ("c14n") XML documents:
http://www.w3.org/TR/xml-c14n
The assumtions you have to make when you compare XML documents as text are
(but aren't limited to):
* Attribute order: even if you use <xsl:sort> when outputting attributes,
there is no guarantee that your XSLT processor will honor that order.
* Character sets: two documents can be written in different character sets and
have different byte representations (your database might compare the text as
a string of bytes, rather than a string of Unicode characters), but yet have
the same meaning.
* Character escaping / CDATA sections: exactly which characters are escaped by
your processor is not guaranteed. For example, '>' and '>' have the same
meaning, but obviously different text values.
I'm sure there are many other considerations, which should all be addressed by
the xml-c14n spec.
I'm not saying what you are trying to do won't work. As long as you always
use the same XML processor, stylesheet, character set (UTF-8?), and you don't
add comments, CDATA sections, whitespace, etc., this will work into the
future. What you should consider is what happens when you try to use a newer
version of your processor that changes its rules (but still outputs
equivilant XML, just not equivilant text), or if some new person comes to
work for you who doesn't follow your rules. Maintainability is always an
issue when you are designing software systems.
So, just keep the issues in mind when you do this. It will work, but if it
stops working one day after you try to upgrade your system, you will know
where to look.
--
Peter Davis
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list