This is the mail archive of the
docbook-apps@lists.oasis-open.org
mailing list .
Re: [docbook-apps] Permuted Index, may be OT
- From: "John W. Shipman" <john at nmt dot edu>
- To: Riaan Bredenkamp <riaanbre at webmail dot co dot za>
- Cc: docbook-apps at lists dot oasis-open dot org
- Date: Fri, 26 Mar 2004 09:01:49 -0700 (MST)
- Subject: Re: [docbook-apps] Permuted Index, may be OT
I have a working permuted index, but it's not related to
DocBook. Documentation is at:
http://www.nmt.edu/tcc/projects/webstyler/
Follow the link for `StylIndex' near the bottom of the page.
An example of an index built with this system is at:
http://www.nmt.edu/tcc/help/
Click on the `Index' link on the top line of the page.
I have attached Python source code for the module that
does permuted indexing. Feel free to borrow or adapt
it.
John Shipman (john@nmt.edu), Applications Specialist, NM Tech Computer Center,
Speare 128, Socorro, NM 87801, (505) 835-5950, http://www.nmt.edu/~john
``Let's go outside and commiserate with nature.'' --Dave Farber
================================================================
"""kwic.py: Module for Key Word In Context (KWIC) indices.
$Revision: 1.4 $ $Date: 2003/08/06 20:28:39 $
KWIC indexing is a technique for finding occurrences of keywords
in a set of strings. It is a venerable technique dating back to
the 1960s. Definitions:
KEYWORD: A contiguous string of keyword characters that
is not in the exclusion list. Typically, keyword
characters include letters, digits, and possibly
pseudo-letters such as "_" and "-".
EXCLUSION LIST: A set of words that should be omitted
from the index, such as `a', `and', and `the'.
EXCLUSION FILE: The file containing the exclusion list.
The format is free: each contiguous clump of
keyword characters is considered an excluded word.
The retrieved strings may be presented in either of two forms:
- UNPERMUTED FORM: The string is divided into the PREFIX
(everything up to the keyword), the keyword, and the
SUFFIX (everything after the keyword. Suppose the string
is `Driving Miss Daisy'. Assuming none of those words are
in the exclusion list, this string would appear in three
index entries:
[ "Driving Miss ", "Daisy", "" ] # Keyword "Daisy"
[ "", "Driving", " Miss Daisy" ] # Keyword "Driving"
[ "Driving ", "Miss", " Daisy" ] # Keyword "Miss"
- PERMUTED FORM: All strings are padded to the maximum length
of any string in the index, plus a GUTTER consisting of
a few spaces, and then rotated so that the keyword always
starts at the same point, the PERMUTE POINT.
Here is an example of how the above three index entries
might appear in a permuted form if the longest string in
the index, plus the gutter, is 40 characters, and the
character "=" is inserted before the beginning of each
string to show its original beginning. Also suppose
that the `permute point' is 0.25, meaning that the
keywords line up on column 10. Further suppose that
the string `Harold and Maude' has also been indexed,
and the word `and' is in the exclusion list. Here is
how this index would look in permuted form:
| +-- Permute point |
| v |
|0 1 2 3 4|
|01234567890123456789012345678901234567890|
|-----------------------------------------|
|ving Miss Daisy =Dri|
| =Driving Miss Daisy |
| =Harold and Maude |
|arold and Maude =H|
| =Driving Miss Daisy |
Exports:
class KwicIndex: Represents a set of strings.
KwicIndex(keyCset=None, exclusionFileName=None, permutePoint=None,
gutterSize=None, startMark=None, endMark=None,
keyPrefix=None, keySuffix=None):
[ (keyCset is a Cset defining the keyword characters, defaulting
to DEFAULT_KEY_CSET) and
(exclusionFileName is the name of an exclusion file, defaulting
to no exclusions) and
(permutePoint is the permute point in [0.0,1.0), defaulting
to DEFAULT_PERMUTE_POINT) and
(gutterSize is the gutter size as a positive integer,
defaulting to DEFAULT_GUTTER_SIZE) and
(startMark is a string to be inserted before the index entry,
defaulting to DEFAULT_START_MARK) and
(endMark is a string to be inserted after the index entry,
defaulting to "") and
(keyPrefix is a string to be inserted before the keyword,
defaulting to "") and
(keySuffix is a string to be inserted after the keyword,
defaulting to "") ->
if exclusionFileName is not None and does not name a
readable file ->
raise IOError
else ->
return a new, empty KwicIndex object with those values ]
.keyCset: [ as passed to constructor, read-only ]
.exclusionFileName: [ as passed to constructor, R-O ]
.permutePoint: [ as passed to constructor with defaulting, R-O ]
.gutterSize: [ as passed to constructor with defaulting, R-O ]
.startMark: [ as passed to constructor with defaulting, R-O ]
.endMark: [ as passed to constructor with defaulting, R-O ]
.keyPrefix: [ as passed to constructor with defaulting, R-O ]
.keySuffix: [ as passed to constructor with defaulting, R-O ]
.add ( s, value=None ):
[ if s is a nonempty string ->
self := self with s added to its set of strings and
value associated with it ]
.genRefs ( startKey=None, stopKey=None ):
[ (startKey is a string or None) and
(stopKey is a string or None) ->
generate a sequence of KwicRef objects in ascending order
by (keyword+prefix+suffix), ignoring case; if startKey
is provided, all keywords < startKey are omitted; if
stopKey is provided, all keywords >= stopKey are omitted ]
.maxLength():
[ returns the length of the longest string in self ]
.permute(ref):
[ ref is a KwicRef ->
return ref permuted according to self's parameters ]
.permuteLink(ref, linkifier):
[ (ref is a KwicRef) and
(linkifier is a procedure with calling sequence
linkifier(ref, linkText)
and intended function
[ (ref is a KwicRef) and (linkText is a string) ->
return a string containing an HTML hyperlink
whose link text is (linkText) and whose href
attribute is derived from ref ]
then ->
return a string consisting of the ref, permuted
according to self's parameters, and with
(self.keyPrefix+keyword+self.keySuffix) turned into
a link by linkifier (or as much of that string as
fits between the permute point and the end of the
permuted line) ]
class KwicRef: Represents an occurrence of a keyword in a string.
KwicRef(s, startPos, endPos, value):
[ (s is a nonempty string) and
(startPos and endPos define the position of the keyword
as a nonempty slice of s in the usual Python convention
of s[startPos:endPos]) and
(value is any object) ->
return a new KwicRef object with those values
.s: [ as passed to constructor, read-only ]
.startPos: [ as passed to constructor, read-only ]
.endPos: [ as passed to constructor, read-only ]
.value: [ as passed to constructor, read-only ]
.show():
[ return a triple (prefix, keyword, suffix) where each
element is a string, prefix may be empty, and suffix may
be empty ]
.prefix(): [ return the prefix from self ]
.keyword(): [ return the keyword from self ]
.suffix(): [ return the suffix from self ]
.__str__(self): [ return self.s ]
.__cmp__(self,other):
[ other is a KwicRef ->
if self.value is lexically before other.value ->
return -1
else if self.value is lexically after other.value ->
return 1
else -> return 0 ]
class ExclusionList: Represents the exclusion file.
ExclusionList ( exclusionFileName=None, keyCset=None ):
[ ( exclusionFileName names the exclusion file, or None
to start with an empty exclusion set ) and
( keyCset is a Cset enumerating the keyword characters,
defaulting to DEFAULT_KEY_CSET ) ->
if exclusionFileName names a readable file ->
return a new ExclusionList object containing all
the unique keywords in that file
else -> raise IOError ]
.__contains__(self, x): # `x in self' Python operator
[ x is a string ->
if self contains x, case-insensitive ->
return 1
else -> return 0 ]
"""
#================================================================
# IMPORTS
#----------------------------------------------------------------
from __future__ import generators # Allow 2.2 generators
import string # Standard Python string library
#--
# Library routines from /u/john/tcc/python/lib
#--
from set import * # String set object
from cset import * # Character set object
from log import * # Error logging object
from scan import * # Stream scanning object
from skiplist import * # SkipList ordered container class
# - - - f i n d K e y w o r d s - - -
def findKeywords ( s, cset ):
"""Generates slice indices of all contiguous strings in cset.
[ (s is a string) and
(cset is a Cset object defining all the keyword characters) ->
generate (startx, endx) tuples such that each slice
s[startx:endx] is a contiguous clump of keyword characters in s ]
"""
#-- 1 --
startx = None
#-- 2 --
# startx := index of the last keyword character not
# preceded by a keyword character
# generate (startx, endx) slice indices defining all keywords
# that have non-keyword characters following ]
for x in range ( len ( s ) ):
#-- 2 body --
# [ if (startx is None) and (s[x] is a keyword character) ->
# startx := x
# if (startx is not None) and (s[x] is a keyword character) ->
# I
# if (startx is None) and (s[x] is not a keyword character) ->
# I
# if (startx is not None) and (s[x] is not a keyword character) ->
# yield (startx,x)
# startx := None ]
if cset.has(s[x]): # s[x] is a keyword character
if startx is None:
startx = x
else: # s[x] is not a keyword character
if startx is not None:
yield (startx, x)
startx = None
#-- 3 --
if startx:
yield ( startx, len(s) )
# - - - d e f a u l t e r - - -
def defaulter ( value, defaultValue ):
"""Implements the "?:" operator
[ if value is None ->
return defaultValue
else -> return value ]
"""
if value is None:
return defaultValue
else:
return value
# - - - - - c l a s s K w i c I n d e x - - - - -
class KwicIndex:
"""Object to represent a complete index.
State/Invariants:
.__breakPoint: [ 1.0 - self.permutePoint ]
.__exclusionList:
[ an ExclusionList object containing the set
of words from self.exclusionFile, if any ]
.__keyList:
[ a SkipList object containing all references as KwicRef objects ]
.__maxLength:
[ if self.__keyList is empty -> 0
else -> length of the longest string in self.__keyList ]
"""
DEFAULT_KEY_CSET = Cset ( string.letters + string.digits + "_" )
DEFAULT_PERMUTE_POINT = 0.3 # Default for permutePoint
DEFAULT_GUTTER_SIZE = 2 # Default for gutterSize
DEFAULT_START_MARK = "=" # Default for startMark
# - - - K w i c I n d e x . _ _ i n i t _ _ - - -
def __init__ ( self, keyCset=None, exclusionFileName=None,
permutePoint=None, gutterSize=None,
startMark=None, endMark=None,
keyPrefix=None, keySuffix=None):
"Constructor for KwicIndex"
#-- 1 --
# [ self := self with all parameter invariants in place ]
self.keyCset = defaulter ( keyCset,
self.DEFAULT_KEY_CSET )
self.exclusionFileName = exclusionFileName
self.permutePoint = defaulter ( permutePoint,
self.DEFAULT_PERMUTE_POINT )
self.gutterSize = defaulter ( gutterSize,
self.DEFAULT_GUTTER_SIZE )
self.startMark = defaulter ( startMark,
self.DEFAULT_START_MARK )
self.endMark = defaulter ( endMark, "" )
self.keyPrefix = defaulter ( keyPrefix, "" )
self.keySuffix = defaulter ( keySuffix, "" )
#-- 2 --
# [ self.__breakPoint := as invariant
# self.__keyList := an empty SkipList object that allows
# duplicates
# self.__maxLength := 0 ]
self.__breakPoint = 1.0 - self.permutePoint
self.__keyList = SkipList ( allowDups=1 )
self.__maxLength = 0
#-- 3 --
# [ if self.exclusionFileName is None ->
# self.__exclusionList := a new, empty ExclusionList
# else if self.exclusionFileName names a readable file ->
# self.__exclusionList := a new ExclusionList object
# containing all keywords in that file
# else ->
# raise IOError ]
self.__exclusionList = ExclusionList ( exclusionFileName,
self.keyCset )
# - - - K w i c I n d e x . a d d - - -
def add ( self, s, value ):
"Add string s to self"
#-- 1 --
# [ self.__keyList +:= KwicRef objects for all occurrences
# of keywords in s ]
#-- 1 head --
# [ for every keyword character that is not preceded by a
# keyword character ->
# startx := index of that keyword character
# endx := index of the first character after startx
# that is not a keyword character
# <loop body> ]
for startx, endx in findKeywords(s, self.keyCset):
#-- 1 body --
# [ self.__keyList +:= a KwicRef object for string s
# and slice [startx:endx] for value=value ]
self.__addWord ( s, startx, endx, value )
#-- 2 --
# [ self.__maxLength := as invariant ]
self.__maxLength = max ( self.__maxLength, len(s) )
# - - - K w i c I n d e x . _ _ a d d W o r d - - -
def __addWord ( self, s, startx, endx, value ):
"""Add one entry in self, for one occurrence of a keyword.
[ (s is a string) and
(0 <= startx <= endx < len(s)) ->
if uppercased s[startx:endx] is in self.__exclusionList ->
I
else ->
self.__keyList +:= a KwicRef object for string s
and slice [startx:endx] ] and
value=value ]
"""
#-- 1 --
# [ keyword := uppercased s[startx:endx] ]
keyword = s[startx:endx].upper()
#-- 2 --
# [ if keyword is in self.__exclusionList ->
# return
# else -> I ]
if keyword in self.__exclusionList:
return
#-- 3 --
# [ self.__keyList +:= a KwicRef object for string s and
# slice [start:endx] ]
ref = KwicRef ( s, startx, endx, value )
self.__keyList.insert ( ref )
# - - - K w i c I n d e x . g e n R e f s - - -
def genRefs ( self, startKey=None, stopKey=None ):
"Generate all the references in self, in index order."
#-- 1 --
# [ if self.__keyList is empty ->
# ref := None
# else ->
# ref := first element of self.__keyList ]
ref = self.__keyList.first()
#-- 2 --
# [ ref := None
# self.__keyList := self.__keyList advanced to end
# yield all elements of (ref + self.__keyList) that
# are in the range [startKey,stopKey), case-insensitive, in order ]
while ref is not None:
#-- 2 loop --
# [ if ref is in the range [startKey,stopKey) ->
# yield ref
# else -> I
# In any case ->
# ref, self__keyList := first(self.__keyList),
# last(self.__keyList) ]
#-- 2.1 --
# [ if ref is in the range [startKey,stopKey) ->
# yield ref
# else -> I ]
key = ref.keyword().upper()
if ( ( ( startKey is None ) or ( startKey <= key ) ) and
( ( stopKey is None ) or ( key < stopKey ) ) ):
yield ref
#-- 2.2 --
# [ if self.__keyList is at end ->
# ref := None
# else ->
# ref := next element of self.__keyList
# self.__keyList := self.__keyList advanced one ]
ref = self.__keyList.next()
# - - - K w i c I n d e x . m a x L e n g t h - - -
def maxLength ( self ):
"Return the maximum length of self's strings."
return self.__maxLength
# - - - K w i c I n d e x . p e r m u t e - - -
def permute ( self, ref ):
"Produce the permuted form of a reference."
#-- 1 --
# [ padLen := self.__maxLength + self.gutterSize -
# (length of ref) ]
padLen = ( self.__maxLength + self.gutterSize -
len ( ref.s ) )
#-- 2 --
# [ text := self.keyPrefix + ref's keyword + self.keySuffix +
# ref's suffix + self.endMark + (padLen spaces) +
# self.startMark + ref's prefix ]
text = "".join ( [ self.keyPrefix, ref.keyword(), self.keySuffix,
ref.suffix(), self.endMark, " "*padLen,
self.startMark, ref.prefix() ] )
#-- 3 --
# [ return text, broken into two pieces at a position dictated
# by self.__breakPoint, and those pieces swapped ]
breakPos = 1 + int ( self.__breakPoint * len(text) )
return text[breakPos:] + text[:breakPos]
# - - - K w i c I n d e x . p e r m u t e L i n k - - -
def permuteLink ( self, ref, linkifier ):
"Make self into a hyperlink."
#-- 1 --
# [ padLen := self.__maxLength + self.gutterSize -
# (length of ref.s) ]
padLen = self.__maxLength + self.gutterSize - len(ref.s)
#-- 2 --
# [ head := ref's keyword decorated with keyword prefix & suffix
# tail := ref's suffix followed by ref's keyword, decorated
# with start and mark, with padLen spaces inserted
# at the wraparound point ]
head = "".join ( [ self.keyPrefix, ref.keyword(), self.keySuffix ] )
tail = "".join ( [ ref.suffix(), self.endMark, " "*padLen,
self.startMark, ref.prefix() ] )
#-- 3 --
# [ breakPos := the position where the breakpoint would fall ]
breakPos = 1 + int ( self.__breakPoint *
( len(head) + len(tail) ) )
#-- 4 --
# [ if breakPos > (size of head) ->
# I
# else ->
# head := head with characters past breakPos removed
# tail := (characters from head past breakPos) + tail ]
#--
# Note: This step handles the case where the keyword is
# too long to fit between the permute point and the end of the
# permuted line. When this happens, we move characters from
# the head to the tail so that linkifier() only wraps a link
# around the part that will fit. Without this precaution,
# we run the risk of placing the closing </a> tag before its
# corresponding <a href=...> tag.
#--
if breakPos <= len(head):
tail = head[breakPos:] + tail
head = head[:breakPos]
#-- 5 --
# [ return (characters from tail past breakPos) +
# (head with a link wrapped around it) +
# (characters from tail up to breakPos) ]
return "".join ( [ tail[breakPos:],
linkifier ( ref, head ),
tail[:breakPos] ] )
# - - - - - c l a s s K w i c R e f - - - - -
class KwicRef:
"Represents one occurrence of a keyword in its context."
# - - - K w i c R e f . _ _ i n i t _ _ - - -
def __init__ ( self, s, startPos, endPos, value ):
"Constructor for KwicRef"
self.s = s
self.startPos = startPos
self.endPos = endPos
self.value = value
# - - - K w i c R e f . s h o w - - -
def show ( self ):
"Return a (prefix, keyword, suffix) triple"
return ( self.prefix(), self.keyword(), self.suffix() )
# - - - K w i c R e f . p r e f i x - - -
def prefix ( self ):
"Return self's prefix."
return self.s [ : self.startPos ]
# - - - K w i c R e f . k e y w o r d - - -
def keyword ( self ):
"Return self's keyword."
return self.s [ self.startPos : self.endPos ]
# - - - K w i c R e f . s u f f i x - - -
def suffix ( self ):
"Return self's suffix."
return self.s [ self.endPos : ]
# - - - K w i c R e f . _ _ s t r _ _ - - -
def __str__ ( self ):
"Return self's entire context string."
return self.s
# - - - K w i c R e f . _ _ c m p _ _ - - -
def __cmp__ ( self, other ):
"""Compare two KwicRef objects lexically.
The effective key value of an entry is the keyword,
followed by one space, followed by the suffix and then
the prefix as tiebreakers. The space causes all the
lines with the same keyword to group together; this
was added pursuant to a bug found 1996-10-13 in the
Icon version.
Example: the left-hand column shows the (keyword, suffix)
and the right-hand column shows how they'd sort without
the extra space:
("icon", "setting",...) ICONSETTING...
("icons", "ileaf",...) ICONSILEAF...
("icon", "text",...) ICONTEXT...
but that second line should be with "icons", not with "icon".
"""
#-- 1 --
# [ keyA := self's keyword + " " + self's suffix + self's prefix,
# upshifted
# keyB := other's keyword + " " + other's suffix + other's
# prefix, upshifted ]
format = "%s %s%s"
keyA = format % (self.keyword(), self.suffix(), self.prefix())
keyB = format % (other.keyword(), other.suffix(), other.prefix())
#-- 2 --
return cmp ( keyA.upper(), keyB.upper() )
# - - - - - c l a s s E x c l u s i o n L i s t - - - - -
class ExclusionList:
"""Represents the exclusion file, listing keywords not to be indexed."
State/Invariants:
self.__set: [ a Set object containing self's keywords ]
"""
# - - - E x c l u s i o n L i s t . _ _ i n i t _ _ - - -
def __init__ ( self, exclusionFileName=None, keyCset=None ):
"Constructor for ExclusionList. Reads the exclusion file."
#-- 1 --
# [ self.__set := a new, empty Set object ]
self.__set = Set()
#-- 2 --
if exclusionFileName is None:
return
#-- 3 --
# [ if exclusionFileName names a readable file ->
# exclusionFile := that file
# else -> raise IOError ]
exclusionFile = open ( exclusionFileName )
#-- 4 --
# [ self.__set +:= clumps of characters in keyCset
# found in exclusionFile, upshifted ]
for line in exclusionFile:
#-- 4 body --
# [ self.__set +:= clumps of characters in keyCset
# found in line ]
self.__readLine ( line, keyCset )
#-- 5 --
exclusionFile.close()
# - - - E x c l u s i o n L i s t . _ _ r e a d L i n e - - -
def __readLine ( self, line, keyCset ):
"""Extract the words from one line of the exclusion file.
[ line is a string ->
self.__set +:= elements from keywords found in line,
upshifted ]
"""
#-- 1 iteration --
# [ startx,endx := the starting and ending slice indices of
# each clump of characters in keyCset found in line,
# upshifted, in turn ]
for startx, endx in findKeywords ( line, keyCset ):
#-- 1 body --
# [ self.__set +:= line[startx:endx], upshifted ]
self.__set.add ( line[startx:endx].upper() )
# - - - E x c l u s i o n L i s t . _ _ c o n t a i n s _ _ - - -
def __contains__ ( self, x ):
"Does self contain x, case-insensitive?"
return x.upper() in self.__set
To unsubscribe from this list, send a post to docbook-apps-unsubscribe@lists.oasis-open.org, or visit http://www.oasis-open.org/mlmanage/.