This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Xsplit, multilingual web site, and extracting text from HTML





Hi,
Firt of all I'd like to Thank everyone for their help regarding my multilingual
web site
question and my questions regarding XSplit. I got the most useful responses
from this group.
I finally manged to download the XSplit and have been playing around
with it. I discovered, to my dissapointment, that it doesn't automatically
create the xml files for you though. I took an HTML page and performed
the "Split command" but I simply got an XSL file with all the HTML in it -
the XML file it "generated" was empty.
I read the documentation and it explained that I had to tag the content in
the HTML page first. After I did this, XSplit correctly generated the XSL file
and the XML file.
For example, in an HTML file containing
<p>Hello World</p>
I had to extract the "Hello World" string from the HTML and replace
it with a label prefixed by "psx-" :
<p>psx-mytext</p>
and then add the "Hello World" String to the generated XML file - which contains
<mytext></mytext>

I was hoping XSplit would generate the XML for me by simply using the HTML
tag names and numbering them wherever it found content. e.g.
I was hoping the following html would convert to the following XML
<p>Hello World</p>
would convert to
<p1>Hello World</p1>
in the xml file. That way I wouldn't have to tag the data unless I really wanted
to.

Is what I'd like to do possible in any way with XSplit ? Am I missing something
?
Are there tools out there that would extract all displayable text from HTML
files
replacing them with labels and then put the extracted text in a sperate file
with the
labels. Basically, I'm looking for a way to automate this since we have 1000's
of
HTML files. I think using an XML & XSL solution for a multilingual site is the
way
to go, but I'm having a hard time justifying the initial cost for converting all
our HTML
files. Since it's an automated process I'm hoping that there's tools out there
that
could help us. I'd write a tool myself, but I'd have to create an HTML parser
which
knew where to find all "displayable text" in an HTML page - which seems tough.
I searched on the Web for HTML parsers which extract text but didn't find
anything
similiar to what I mentioned above (that would replace the text with labels
etc).
Any help would be greatly appreciated.
Thanks :-)
-Sher



 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]