This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

RE: HTML section headings to XML document sections


Let's start with a single-level problem to keep things easy:

<h1/><p/><p/><p/><h1/><p/><p/><p/><h1/><p/>

Like all grouping problems, you need an outer loop that iterates once per
group, and an inner loop that processes the elements within the group.

The outer loop here is easy:

<xsl:for-each select="h1">

The inner loop is trickier:

<xsl:variable name="head" select="generate-id(.)"/>
<xsl:for-each select="following-sibling::p[
                       preceding-sibling::h1[1][generate-id()=$head]]">

This selects the following sibling <p> elements whose most recent preceding
<h1> element was the one we first thought of.

You can solve this one using recursion as well, but you don't need to.

I'll leave the extension to n dimensions as an exercise for the reader...

Alternatively of course there are proprietary techniques, useful if this is
a one-off exercise. The function saxon:leading() was designed for this very
purpose; it allows you to select all following siblings that satisfy a
particular condition, stopping at the first one that doesn't.

Mike Kay
Software AG

> -----Original Message-----
> From: owner-xsl-list@lists.mulberrytech.com
> [mailto:owner-xsl-list@lists.mulberrytech.com]On Behalf Of Michel
> Goossens
> Sent: 09 August 2001 08:01
> To: xsl-list@lists.mulberrytech.com
> Cc: Michel Goossens
> Subject: [xsl] HTML section headings to XML document sections
>
>
> I have a lot of XHTML documents (mostly sanitized HTML with
> tidy and saved
> with the -asxml option) that I would like to transform into XML (e.g.,
> DocBook). The structure of HTML is however drastically different in
> that standard HTML does not mark up the hierarchical subdivisions of a
> document apart from indicating the start of each level by <h1>, <h2>,
> <h3>, etc. Therefore my problem is to find a general
> algorithm, probably
> using recursion, to transform an HTML document into a valid
> XML equivalent,
> in particular indicating its hierarchical structure. For
> instance, suppose
> I have an HTML source like this:
>
> <html>
> <h1>...</h1>....
> <h2>...</h2>....
> <h2>...</h2>....
> <h3>...</h3>....
> <h1>...</h1>....
> <h2>...</h2>....
> <h3>...</h3>....
> <h3>...</h3>....
> <h2>...</h2>....
> </html>
>
> this should become semething like
>
> <html>
> <sect1><title>...</title>
> ....
> <sect2><title>...</title>
> ....
> </sect2>
> <sect2><title>...</title>
> ....
> <sect3><title>...</title>
> ....
> </sect3>
> </sect2>
> </sect1>
> <sect1><title>...</title>
> ....
> <sect2><title>...</title>
> ....
> </sect2>
> <sect3><title>...</title>
> ....
> </sect3>
> <sect3><title>...</title>
> ....
> </sect3>
> </sect2>
> <sect2><title>...</title>
> ....
> </sect2>
> </sect1>
> </html>
>
> So the question is how to know each time a <hx> (h1, h2, h3,
> ...) element
> is encountered what are the "open h" levels less than or equal to that
> of the current element, so that we can "close" them. In
> particular, before
> exiting the document we should also close the complete
> hierarchy correctly.
>
> I have read with interest an article by Benoit Marchal mentioned here
> recently: "recurse, not divide, to conquer", where he
> describes the use of
> recursion for "hierarchising" a flat document, but I cannot
> really see how
> to apply his approach in the present case without somehow
> also knowing the
> "state" (hierarchical level) at the given point in the
> document. Reading
> the discussion of recursion in MK's book or in "Professional
> XSL" did not
> make me a lot wiser on how to solve this in an elegant way.
> Therefore, all
> suggestions are very welcome. Thanks in advance. mg
>
> Dr. Michel Goossens              Phone:(+41 22) 767-4902
> CERN, IT Division                Fax:  (+41 22) 767-8630
> CH-1211 Geneva 23, Switzerland   Email: michel.goossens@cern.ch
>
>
>  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
>


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]