This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: extracting data in CDATA block of a XML document


Srinivas Ch wrote:
> Now I need to extract all the elements between the
> <![CDATA[ and ]]> and write it into a new xml file.

This is a FAQ, but we all like to give long-winded answers rather than point
you to www.dpawson.co.uk.

The other answers to your question so far have been trying to tell you:

1. What you want is not possible with XSLT, at least not in a way that is
reliable. We aren't going to tell you the unreliable way because you need to
approach this problem differently if you don't want to get burned.

2. It was a poor design decision to embed structured markup in the character
data content of an XML element. Character data is by definition NOT MARKUP.

3. CDATA sections are a convenience for document authors and are relevant for
input only. They just keep you from having to escape "<" and "&" in character
data. It means "this looks like markup but it isn't really". The idea is that

<foo><![CDATA[<bar/>]]></foo>

and

<foo>&lt;bar/></foo>

mean exactly the same thing: An element named 'foo' containing the 6
characters '<bar/>'; NOT an element named 'foo' containing an empty element
named 'bar'. If you wanted the latter, you'd have written <foo><bar/></foo>.

In XPath/XSLT you deal with a node tree that is set up quite similarly:

element 'foo' in no namespace
  |
  |__text '<bar/>'

The text node is going to be what you see there, regardless of whether you
used a CDATA section in the original document.

Since you want XML output, your question is how do you produce a result
tree that looks like this

  element 'bar' in no namespace

And the answer is, that's pretty darn difficult because you would have to
mimic the duties of an XML parser, tearing apart the string in the text node
in order to build the right nodes in the result tree.

The workaround that some idiot is going to suggest with a "hey it works for
me!" but not realizing how unportable it is, is going to involve leaving the
text node unchanged but flagging it as an exceptional case for unmodified
serialization, so that it will be emitted as a string of what could very well
be total garbage in the middle of proper, well-formed XML. And that's assuming
you're serializing the result tree, which isn't always a good assumption (in a
browser-based processor you're likely to be passing it as a DOM).

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]