This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: HTML section headings to XML document sections


Michel,

The best solutions to this currently (IMHO) are Jeni's (references already 
posted). She and I kind of leap-frogged development of a solution (I've 
called it "levitation" and you'll find my contributions in the list 
archives, I'll bet, if you search for that -- but that name for the problem 
doesn't seem to have stuck :-). But of course Jeni writes great code *and* 
documents it. The solution is to treat the problem as a special case of 
grouping, driving it all with keys that associate each node to the node 
that indicates its proper place in the hierarchy (generally the head of the 
invisible section it's in).

But I think you'll find you'll have problems since your HTML coming in is 
not likely to be very regular. For example, if (when) you get something like...

h1
  p
  p
   h3
    p
    p
    p

you need to make a decision about whether to interpolate a missing level 
(that would be headed with an h2), that just happens to have no header 
(these things do happen in structured text), or whether to promote the h3 
and its following p elements to the second level. Unfortunately, which of 
these ways is "correct" will depend on the documents: it may vary, and from 
the purist's point of view might require or demand an interpretation on a 
case-by-case basis. Not good.

So it will come down to (a) how good (bad) your data actually is, and (b) 
how brutal you can afford to be.

Enjoy,
Wendell


At 03:01 AM 8/9/01, you wrote:
>I have a lot of XHTML documents (mostly sanitized HTML with tidy and saved
>with the -asxml option) that I would like to transform into XML (e.g.,
>DocBook). The structure of HTML is however drastically different in
>that standard HTML does not mark up the hierarchical subdivisions of a
>document apart from indicating the start of each level by <h1>, <h2>,
><h3>, etc. Therefore my problem is to find a general algorithm, probably
>using recursion, to transform an HTML document into a valid XML equivalent,
>in particular indicating its hierarchical structure. For instance, suppose
>I have an HTML source like this:
>
><html>
><h1>...</h1>....
><h2>...</h2>....
><h2>...</h2>....
><h3>...</h3>....
><h1>...</h1>....
><h2>...</h2>....
><h3>...</h3>....
><h3>...</h3>....
><h2>...</h2>....
></html>
>
>this should become semething like
>
><html>
><sect1><title>...</title>
>....
><sect2><title>...</title>
>....
></sect2>
><sect2><title>...</title>
>....
><sect3><title>...</title>
>....
></sect3>
></sect2>
></sect1>
><sect1><title>...</title>
>....
><sect2><title>...</title>
>....
></sect2>
><sect3><title>...</title>
>....
></sect3>
><sect3><title>...</title>
>....
></sect3>
></sect2>
><sect2><title>...</title>
>....
></sect2>
></sect1>
></html>
>
>So the question is how to know each time a <hx> (h1, h2, h3, ...) element
>is encountered what are the "open h" levels less than or equal to that
>of the current element, so that we can "close" them. In particular, before
>exiting the document we should also close the complete hierarchy correctly.
>
>I have read with interest an article by Benoit Marchal mentioned here
>recently: "recurse, not divide, to conquer", where he describes the use of
>recursion for "hierarchising" a flat document, but I cannot really see how
>to apply his approach in the present case without somehow also knowing the
>"state" (hierarchical level) at the given point in the document. Reading
>the discussion of recursion in MK's book or in "Professional XSL" did not
>make me a lot wiser on how to solve this in an elegant way. Therefore, all
>suggestions are very welcome. Thanks in advance. mg
>
>Dr. Michel Goossens              Phone:(+41 22) 767-4902
>CERN, IT Division                Fax:  (+41 22) 767-8630
>CH-1211 Geneva 23, Switzerland   Email: michel.goossens@cern.ch
>
>
>  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


======================================================================
Wendell Piez                            mailto:wapiez@mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
   Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]