This is the mail archive of the
xsl-list@mulberrytech.com
mailing list .
Re: HTML section headings to XML document sections
- To: xsl-list at lists dot mulberrytech dot com
- Subject: Re: [xsl] HTML section headings to XML document sections
- From: Wendell Piez <wapiez at mulberrytech dot com>
- Date: Thu, 09 Aug 2001 11:55:11 -0400
- References: <3B71AA06.CA195E0D@auckland.ac.nz>
- Reply-To: xsl-list at lists dot mulberrytech dot com
Michel,
The best solutions to this currently (IMHO) are Jeni's (references already
posted). She and I kind of leap-frogged development of a solution (I've
called it "levitation" and you'll find my contributions in the list
archives, I'll bet, if you search for that -- but that name for the problem
doesn't seem to have stuck :-). But of course Jeni writes great code *and*
documents it. The solution is to treat the problem as a special case of
grouping, driving it all with keys that associate each node to the node
that indicates its proper place in the hierarchy (generally the head of the
invisible section it's in).
But I think you'll find you'll have problems since your HTML coming in is
not likely to be very regular. For example, if (when) you get something like...
h1
p
p
h3
p
p
p
you need to make a decision about whether to interpolate a missing level
(that would be headed with an h2), that just happens to have no header
(these things do happen in structured text), or whether to promote the h3
and its following p elements to the second level. Unfortunately, which of
these ways is "correct" will depend on the documents: it may vary, and from
the purist's point of view might require or demand an interpretation on a
case-by-case basis. Not good.
So it will come down to (a) how good (bad) your data actually is, and (b)
how brutal you can afford to be.
Enjoy,
Wendell
At 03:01 AM 8/9/01, you wrote:
>I have a lot of XHTML documents (mostly sanitized HTML with tidy and saved
>with the -asxml option) that I would like to transform into XML (e.g.,
>DocBook). The structure of HTML is however drastically different in
>that standard HTML does not mark up the hierarchical subdivisions of a
>document apart from indicating the start of each level by <h1>, <h2>,
><h3>, etc. Therefore my problem is to find a general algorithm, probably
>using recursion, to transform an HTML document into a valid XML equivalent,
>in particular indicating its hierarchical structure. For instance, suppose
>I have an HTML source like this:
>
><html>
><h1>...</h1>....
><h2>...</h2>....
><h2>...</h2>....
><h3>...</h3>....
><h1>...</h1>....
><h2>...</h2>....
><h3>...</h3>....
><h3>...</h3>....
><h2>...</h2>....
></html>
>
>this should become semething like
>
><html>
><sect1><title>...</title>
>....
><sect2><title>...</title>
>....
></sect2>
><sect2><title>...</title>
>....
><sect3><title>...</title>
>....
></sect3>
></sect2>
></sect1>
><sect1><title>...</title>
>....
><sect2><title>...</title>
>....
></sect2>
><sect3><title>...</title>
>....
></sect3>
><sect3><title>...</title>
>....
></sect3>
></sect2>
><sect2><title>...</title>
>....
></sect2>
></sect1>
></html>
>
>So the question is how to know each time a <hx> (h1, h2, h3, ...) element
>is encountered what are the "open h" levels less than or equal to that
>of the current element, so that we can "close" them. In particular, before
>exiting the document we should also close the complete hierarchy correctly.
>
>I have read with interest an article by Benoit Marchal mentioned here
>recently: "recurse, not divide, to conquer", where he describes the use of
>recursion for "hierarchising" a flat document, but I cannot really see how
>to apply his approach in the present case without somehow also knowing the
>"state" (hierarchical level) at the given point in the document. Reading
>the discussion of recursion in MK's book or in "Professional XSL" did not
>make me a lot wiser on how to solve this in an elegant way. Therefore, all
>suggestions are very welcome. Thanks in advance. mg
>
>Dr. Michel Goossens Phone:(+41 22) 767-4902
>CERN, IT Division Fax: (+41 22) 767-8630
>CH-1211 Geneva 23, Switzerland Email: michel.goossens@cern.ch
>
>
> XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
======================================================================
Wendell Piez mailto:wapiez@mulberrytech.com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list