This is the mail archive of the
xsl-list@mulberrytech.com
mailing list .
Re: National Language Collating Sequences and Index Generation
- From: Joerg Pietschmann <joerg dot pietschmann at zkb dot ch>
- To: XSL List <xsl-list at lists dot mulberrytech dot com>
- Date: Fri, 08 Feb 2002 10:39:03 +0100
- Subject: Re: [xsl] National Language Collating Sequences and Index Generation
- Organization: ZKB
- Reply-to: xsl-list at lists dot mulberrytech dot com
"W. Eliot Kimber" <eliot@isogen.com> wrote:
> I have to generate back-of-the-book indexes for many national languages,
> including Arabic, Hebrew, Thai, Simplified Chinese, Traditional Chinese,
> Korean, and Japanese. I've successfully adapted the Docbook index
> generation code to produce the basic index, but now I'm faced with the
> challenge of both doing correct sorting for these languages and
> generating the appropriate index groups.
That's an interesting topic and a real, already acknowledged but
in general not quite solved problem.
In XSLT 1.0, xsl:sort sorts strings lexically by Unicode code point
number, IIRC. Localized sorting by a single character should also
relatively easy to implement if you can get hold of the collating
sequence:
<xsl:stylesheet ...
xmlns:coll="my.collating.sequence"/>
<coll:sequence>
<char char="A" number="1"/>
<char char="B" number="2"/>
...
</coll:sequence>
<xsl:variable name="collseq" select="document('')/*/coll:sequence"/>
...
<xsl:for-each select="$items">
<xsl:sort select="$collseq[@char=substring(current()/name,1,1)]/@number"/>
You can try to add
<xsl:sort select="$collseq[@char=substring(current()/name,2,1)]/@number"/>
and so on for more compete lexical sorting.
It could be of some use that you could define fractional numbers for
the sorting keys:
<char char="A" number="1"/>
<char char="Ä" number="1.1"/> <!-- sorry for the entity :-) -->
<char char="a" number="1.5"/>
The caveats are that you better have a complete collating sequence,
and that you shouldn't expect a great performance, especially if you
add a lot of sort clauses. There is also the possibility that you run
afoul unexpected character normalisation issues, users could expect
that ä and ä are interchangable (at least i think so).
In XSLT/XPath 2.0, you can have named collating sequences, but you
shouldn't expect the ones you need are provided by the runtime
system :-((((
HTH
J.Pietschmann
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list