This is the mail archive of the docbook-apps@lists.oasis-open.org mailing list .
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: [docbook-apps] Permuted Index, may be OT

From: "John W. Shipman" <john at nmt dot edu>
To: Riaan Bredenkamp <riaanbre at webmail dot co dot za>
Cc: docbook-apps at lists dot oasis-open dot org
Date: Fri, 26 Mar 2004 09:01:49 -0700 (MST)
Subject: Re: [docbook-apps] Permuted Index, may be OT
I have a working permuted index, but it's not related to
DocBook.  Documentation is at:

    http://www.nmt.edu/tcc/projects/webstyler/

Follow the link for `StylIndex' near the bottom of the page.

An example of an index built with this system is at:

    http://www.nmt.edu/tcc/help/

Click on the `Index' link on the top line of the page.

I have attached Python source code for the module that
does permuted indexing.  Feel free to borrow or adapt
it.

John Shipman (john@nmt.edu), Applications Specialist, NM Tech Computer Center,
Speare 128, Socorro, NM 87801, (505) 835-5950, http://www.nmt.edu/~john
  ``Let's go outside and commiserate with nature.''  --Dave Farber

================================================================
"""kwic.py:  Module for Key Word In Context (KWIC) indices.
    $Revision: 1.4 $  $Date: 2003/08/06 20:28:39 $

    KWIC indexing is a technique for finding occurrences of keywords
    in a set of strings.  It is a venerable technique dating back to
    the 1960s.  Definitions:

        KEYWORD:  A contiguous string of keyword characters that
            is not in the exclusion list.  Typically, keyword
            characters include letters, digits, and possibly
            pseudo-letters such as "_" and "-".

        EXCLUSION LIST:  A set of words that should be omitted
            from the index, such as `a', `and', and `the'.

        EXCLUSION FILE:  The file containing the exclusion list.
            The format is free: each contiguous clump of
            keyword characters is considered an excluded word.

    The retrieved strings may be presented in either of two forms:

      - UNPERMUTED FORM:  The string is divided into the PREFIX
        (everything up to the keyword), the keyword, and the
        SUFFIX (everything after the keyword.  Suppose the string
        is `Driving Miss Daisy'. Assuming none of those words are
        in the exclusion list, this string would appear in three
        index entries:
            [ "Driving Miss ", "Daisy", "" ]    # Keyword "Daisy"
            [ "", "Driving", " Miss Daisy" ]    # Keyword "Driving"
            [ "Driving ", "Miss", " Daisy" ]    # Keyword "Miss"

      - PERMUTED FORM:  All strings are padded to the maximum length
        of any string in the index, plus a GUTTER consisting of
        a few spaces, and then rotated so that the keyword always
        starts at the same point, the PERMUTE POINT.

        Here is an example of how the above three index entries
        might appear in a permuted form if the longest string in
        the index, plus the gutter, is 40 characters, and the
        character "=" is inserted before the beginning of each
        string to show its original beginning.  Also suppose
        that the `permute point' is 0.25, meaning that the
        keywords line up on column 10.  Further suppose that
        the string `Harold and Maude' has also been indexed,
        and the word `and' is in the exclusion list.  Here is
        how this index would look in permuted form:

            |          +-- Permute point              |
            |          v                              |
            |0         1         2         3         4|
            |01234567890123456789012345678901234567890|
            |-----------------------------------------|
            |ving Miss Daisy                      =Dri|
            |         =Driving Miss Daisy             |
            |         =Harold and Maude               |
            |arold and Maude                        =H|
            | =Driving Miss Daisy                     |

  Exports:
    class KwicIndex:  Represents a set of strings.
        KwicIndex(keyCset=None, exclusionFileName=None, permutePoint=None,
                  gutterSize=None, startMark=None, endMark=None,
                  keyPrefix=None, keySuffix=None):
          [ (keyCset is a Cset defining the keyword characters, defaulting
              to DEFAULT_KEY_CSET) and
            (exclusionFileName is the name of an exclusion file, defaulting
              to no exclusions) and
            (permutePoint is the permute point in [0.0,1.0), defaulting
              to DEFAULT_PERMUTE_POINT) and
            (gutterSize is the gutter size as a positive integer,
              defaulting to DEFAULT_GUTTER_SIZE) and
            (startMark is a string to be inserted before the index entry,
              defaulting to DEFAULT_START_MARK) and
            (endMark is a string to be inserted after the index entry,
              defaulting to "") and
            (keyPrefix is a string to be inserted before the keyword,
              defaulting to "") and
            (keySuffix is a string to be inserted after the keyword,
              defaulting to "") ->
                if exclusionFileName is not None and does not name a
                readable file ->
                  raise IOError
                else ->
                  return a new, empty KwicIndex object with those values ]
        .keyCset:           [ as passed to constructor, read-only ]
        .exclusionFileName: [ as passed to constructor, R-O ]
        .permutePoint:      [ as passed to constructor with defaulting, R-O ]
        .gutterSize:        [ as passed to constructor with defaulting, R-O ]
        .startMark:         [ as passed to constructor with defaulting, R-O ]
        .endMark:           [ as passed to constructor with defaulting, R-O ]
        .keyPrefix:         [ as passed to constructor with defaulting, R-O ]
        .keySuffix:         [ as passed to constructor with defaulting, R-O ]
        .add ( s, value=None ):
          [ if s is a nonempty string ->
              self  :=  self with s added to its set of strings and
                value associated with it ]
        .genRefs ( startKey=None, stopKey=None ):
          [ (startKey is a string or None) and
            (stopKey is a string or None) ->
              generate a sequence of KwicRef objects in ascending order
              by (keyword+prefix+suffix), ignoring case; if startKey
              is provided, all keywords < startKey are omitted; if
              stopKey is provided, all keywords >= stopKey are omitted ]
        .maxLength():
          [ returns the length of the longest string in self ]
        .permute(ref):
          [ ref is a KwicRef ->
              return ref permuted according to self's parameters ]
        .permuteLink(ref, linkifier):
          [ (ref is a KwicRef) and
            (linkifier is a procedure with calling sequence
                linkifier(ref, linkText)
              and intended function
                [ (ref is a KwicRef) and (linkText is a string) ->
                    return a string containing an HTML hyperlink
                    whose link text is (linkText) and whose href
                    attribute is derived from ref ]
              then ->
                return a string consisting of the ref, permuted
                according to self's parameters, and with 
                (self.keyPrefix+keyword+self.keySuffix) turned into
                a link by linkifier (or as much of that string as
                fits between the permute point and the end of the
                permuted line) ]
    class KwicRef:      Represents an occurrence of a keyword in a string.
        KwicRef(s, startPos, endPos, value):
          [ (s is a nonempty string) and
            (startPos and endPos define the position of the keyword
              as a nonempty slice of s in the usual Python convention
              of s[startPos:endPos]) and
            (value is any object) ->
              return a new KwicRef object with those values
        .s:             [ as passed to constructor, read-only ]
        .startPos:      [ as passed to constructor, read-only ]
        .endPos:        [ as passed to constructor, read-only ]
        .value:         [ as passed to constructor, read-only ]
        .show():
          [ return a triple (prefix, keyword, suffix) where each
            element is a string, prefix may be empty, and suffix may
            be empty ]
        .prefix():      [ return the prefix from self ]
        .keyword():     [ return the keyword from self ]
        .suffix():      [ return the suffix from self ]
        .__str__(self): [ return self.s ]
        .__cmp__(self,other):
          [ other is a KwicRef ->
              if self.value is lexically before other.value ->
                return -1
              else if self.value is lexically after other.value ->
                return 1
              else -> return 0 ]
    class ExclusionList:  Represents the exclusion file.
        ExclusionList ( exclusionFileName=None, keyCset=None ):
          [ ( exclusionFileName names the exclusion file, or None
              to start with an empty exclusion set ) and
            ( keyCset is a Cset enumerating the keyword characters,
              defaulting to DEFAULT_KEY_CSET ) ->
                if  exclusionFileName names a readable file ->
                  return a new ExclusionList object containing all
                  the unique keywords in that file
                else -> raise IOError ]
        .__contains__(self, x):     # `x in self' Python operator
          [ x is a string ->
              if self contains x, case-insensitive ->
                return 1
              else -> return 0 ]
"""

#================================================================
# IMPORTS
#----------------------------------------------------------------

from __future__ import generators       # Allow 2.2 generators
import string                           # Standard Python string library

#--
# Library routines from /u/john/tcc/python/lib 
#--
    
from set import *           # String set object
from cset import *          # Character set object
from log import *           # Error logging object
from scan import *          # Stream scanning object
from skiplist import *      # SkipList ordered container class


# - - -   f i n d K e y w o r d s   - - -

def findKeywords ( s, cset ):
    """Generates slice indices of all contiguous strings in cset.

      [ (s is a string) and
        (cset is a Cset object defining all the keyword characters) ->
          generate (startx, endx) tuples such that each slice
          s[startx:endx] is a contiguous clump of keyword characters in s ]
    """

    #-- 1 --
    startx  =  None

    #-- 2 --
    #   startx  :=  index of the last keyword character not
    #               preceded by a keyword character
    #   generate (startx, endx) slice indices defining all keywords
    #   that have non-keyword characters following ]
    for  x in range ( len ( s ) ):
        #-- 2 body --
        # [ if (startx is None) and (s[x] is a keyword character) ->
        #     startx  :=  x
        #   if (startx is not None) and (s[x] is a keyword character) ->
        #     I
        #   if (startx is None) and (s[x] is not a keyword character) ->
        #     I
        #   if (startx is not None) and (s[x] is not a keyword character) ->
        #     yield (startx,x)
        #     startx  :=  None ]
        if  cset.has(s[x]):     # s[x] is a keyword character
            if  startx is None:
                startx  =  x
        else:                   # s[x] is not a keyword character
            if  startx is not None:
                yield (startx, x)
                startx  =  None

    #-- 3 --
    if  startx:
        yield ( startx, len(s) )


# - - -   d e f a u l t e r   - - -

def defaulter ( value, defaultValue ):
    """Implements the "?:" operator

      [ if value is None ->
          return defaultValue
        else -> return value ]
    """
    if value is None:
        return defaultValue
    else:
        return value


# - - - - -   c l a s s   K w i c I n d e x   - - - - -

class KwicIndex:
    """Object to represent a complete index.

      State/Invariants:
        .__breakPoint:  [ 1.0 - self.permutePoint ]
        .__exclusionList:
          [ an ExclusionList object containing the set
            of words from self.exclusionFile, if any ]
        .__keyList:
          [ a SkipList object containing all references as KwicRef objects ]
        .__maxLength:
          [ if self.__keyList is empty -> 0
            else -> length of the longest string in self.__keyList ]
    """
    DEFAULT_KEY_CSET        =  Cset ( string.letters + string.digits + "_" )
    DEFAULT_PERMUTE_POINT   =  0.3      # Default for permutePoint
    DEFAULT_GUTTER_SIZE     =  2        # Default for gutterSize
    DEFAULT_START_MARK      =  "="      # Default for startMark


# - - -   K w i c I n d e x . _ _ i n i t _ _   - - -

    def __init__ ( self, keyCset=None, exclusionFileName=None,
                   permutePoint=None, gutterSize=None,
                   startMark=None, endMark=None,
                   keyPrefix=None, keySuffix=None):
        "Constructor for KwicIndex"

        #-- 1 --
        # [ self  :=  self with all parameter invariants in place ]
        self.keyCset            =  defaulter ( keyCset,
                                               self.DEFAULT_KEY_CSET )
        self.exclusionFileName  =  exclusionFileName
        self.permutePoint       =  defaulter ( permutePoint,
                                               self.DEFAULT_PERMUTE_POINT )
        self.gutterSize         =  defaulter ( gutterSize,
                                               self.DEFAULT_GUTTER_SIZE )
        self.startMark          =  defaulter ( startMark,
                                               self.DEFAULT_START_MARK )
        self.endMark            =  defaulter ( endMark,   "" )
        self.keyPrefix          =  defaulter ( keyPrefix, "" )
        self.keySuffix          =  defaulter ( keySuffix, "" )

        #-- 2 --
        # [ self.__breakPoint    :=  as invariant
        #   self.__keyList       :=  an empty SkipList object that allows
        #                            duplicates
        #   self.__maxLength     :=  0 ]
        self.__breakPoint   =  1.0 - self.permutePoint
        self.__keyList      =  SkipList ( allowDups=1 )
        self.__maxLength    =  0

        #-- 3 --
        # [ if self.exclusionFileName is None ->
        #     self.__exclusionList  :=  a new, empty ExclusionList
        #   else if self.exclusionFileName names a readable file ->
        #     self.__exclusionList  :=  a new ExclusionList object
        #         containing all keywords in that file
        #   else ->
        #     raise IOError ]
        self.__exclusionList  =  ExclusionList ( exclusionFileName,
            self.keyCset )


# - - -   K w i c I n d e x . a d d   - - -

    def add ( self, s, value ):
        "Add string s to self"

        #-- 1 --
        # [ self.__keyList  +:=  KwicRef objects for all occurrences
        #     of keywords in s ]

        #-- 1 head --
        # [ for every keyword character that is not preceded by a
        #   keyword character ->
        #     startx  :=  index of that keyword character
        #     endx    :=  index of the first character after startx
        #                 that is not a keyword character
        #     <loop body> ]
        for  startx, endx in findKeywords(s, self.keyCset):
            #-- 1 body --
            # [ self.__keyList  +:=  a KwicRef object for string s
            #       and slice [startx:endx] for value=value ]
            self.__addWord ( s, startx, endx, value )

        #-- 2 --
        # [ self.__maxLength  :=  as invariant ]
        self.__maxLength  =  max ( self.__maxLength, len(s) )



# - - -   K w i c I n d e x . _ _ a d d W o r d   - - -

    def __addWord ( self, s, startx, endx, value ):
        """Add one entry in self, for one occurrence of a keyword.

          [ (s is a string) and
            (0 <= startx <= endx < len(s)) ->
              if  uppercased s[startx:endx] is in self.__exclusionList ->
                I
              else ->
                self.__keyList  +:=  a KwicRef object for string s
                                     and slice [startx:endx] ] and
                                     value=value ]
        """

        #-- 1 --
        # [ keyword  :=  uppercased s[startx:endx] ]
        keyword  =  s[startx:endx].upper()

        #-- 2 --
        # [ if keyword is in self.__exclusionList ->
        #     return
        #   else -> I ]
        if  keyword in self.__exclusionList:
            return

        #-- 3 --
        # [ self.__keyList  +:=  a KwicRef object for string s and
        #                        slice [start:endx] ]
        ref  =  KwicRef ( s, startx, endx, value )
        self.__keyList.insert ( ref )


# - - -   K w i c I n d e x . g e n R e f s   - - -

    def genRefs ( self, startKey=None, stopKey=None ):
        "Generate all the references in self, in index order."

        #-- 1 --
        # [ if self.__keyList is empty ->
        #     ref  :=  None
        #   else ->
        #     ref  :=  first element of self.__keyList ]
        ref  =  self.__keyList.first()

        #-- 2 --
        # [ ref             :=  None
        #   self.__keyList  :=  self.__keyList advanced to end
        #   yield all elements of (ref + self.__keyList) that
        #   are in the range [startKey,stopKey), case-insensitive, in order ]
        while  ref is not None:
            #-- 2 loop --
            # [ if ref is in the range [startKey,stopKey) ->
            #     yield ref
            #   else -> I
            #   In any case ->
            #     ref, self__keyList  :=  first(self.__keyList),
            #                             last(self.__keyList) ]

            #-- 2.1 --
            # [ if ref is in the range [startKey,stopKey) ->
            #     yield ref
            #   else -> I ]
            key  =  ref.keyword().upper()
            if ( ( ( startKey is None ) or ( startKey <= key ) ) and
                 ( ( stopKey is None ) or ( key < stopKey ) ) ):
                yield ref

            #-- 2.2 --
            # [ if  self.__keyList is at end ->
            #     ref  :=  None
            #   else ->
            #     ref  :=  next element of self.__keyList
            #     self.__keyList  :=  self.__keyList advanced one ]
            ref  =  self.__keyList.next()


# - - -   K w i c I n d e x . m a x L e n g t h   - - -

    def maxLength ( self ):
        "Return the maximum length of self's strings."
        return self.__maxLength


# - - -   K w i c I n d e x . p e r m u t e   - - -

    def permute ( self, ref ):
        "Produce the permuted form of a reference."

        #-- 1 --
        # [ padLen  :=  self.__maxLength + self.gutterSize -
        #       (length of ref) ]
        padLen  =  ( self.__maxLength + self.gutterSize -
                     len ( ref.s ) )

        #-- 2 --
        # [ text  :=  self.keyPrefix + ref's keyword + self.keySuffix +
        #             ref's suffix + self.endMark + (padLen spaces) +
        #             self.startMark + ref's prefix ]
        text  =  "".join ( [ self.keyPrefix, ref.keyword(), self.keySuffix, 
                             ref.suffix(), self.endMark, " "*padLen,
                             self.startMark, ref.prefix() ] )

        #-- 3 --
        # [ return text, broken into two pieces at a position dictated
        #       by self.__breakPoint, and those pieces swapped ]
        breakPos  =  1 + int ( self.__breakPoint * len(text) )
        return text[breakPos:] + text[:breakPos]


# - - -   K w i c I n d e x . p e r m u t e L i n k   - - -

    def permuteLink ( self, ref, linkifier ):
        "Make self into a hyperlink."

        #-- 1 --
        # [ padLen  :=  self.__maxLength + self.gutterSize -
        #               (length of ref.s) ]
        padLen  =  self.__maxLength + self.gutterSize - len(ref.s)

        #-- 2 --
        # [ head  :=  ref's keyword decorated with keyword prefix & suffix
        #   tail  :=  ref's suffix followed by ref's keyword, decorated
        #             with start and mark, with padLen spaces inserted
        #             at the wraparound point ]
        head  =  "".join ( [ self.keyPrefix, ref.keyword(), self.keySuffix ] )
        tail  =  "".join ( [ ref.suffix(), self.endMark, " "*padLen,
                             self.startMark, ref.prefix() ] )

        #-- 3 --
        # [ breakPos  :=  the position where the breakpoint would fall ]
        breakPos  =  1 + int ( self.__breakPoint *
                               ( len(head) + len(tail) ) )

        #-- 4 --
        # [ if breakPos > (size of head) ->
        #     I
        #   else ->
        #     head  :=  head with characters past breakPos removed
        #     tail  :=  (characters from head past breakPos) + tail ]
        #--
        # Note: This step handles the case where the keyword is
        # too long to fit between the permute point and the end of the
        # permuted line.  When this happens, we move characters from
        # the head to the tail so that linkifier() only wraps a link
        # around the part that will fit.  Without this precaution,
        # we run the risk of placing the closing </a> tag before its
        # corresponding <a href=...> tag.
        #--
        if  breakPos <= len(head):
            tail  =  head[breakPos:] + tail
            head  =  head[:breakPos]

        #-- 5 --
        # [ return (characters from tail past breakPos) +
        #       (head with a link wrapped around it) +
        #       (characters from tail up to breakPos) ]
        return "".join ( [ tail[breakPos:],
                           linkifier ( ref, head ),
                           tail[:breakPos] ] )



# - - - - -   c l a s s   K w i c R e f   - - - - -

class KwicRef:
    "Represents one occurrence of a keyword in its context."


# - - -   K w i c R e f . _ _ i n i t _ _   - - -

    def __init__ ( self, s, startPos, endPos, value ):
        "Constructor for KwicRef"

        self.s         =  s
        self.startPos  =  startPos
        self.endPos    =  endPos
        self.value     =  value


# - - -   K w i c R e f . s h o w   - - -

    def show ( self ):
        "Return a (prefix, keyword, suffix) triple"
        return ( self.prefix(), self.keyword(), self.suffix() )


# - - -   K w i c R e f . p r e f i x   - - -

    def prefix ( self ):
        "Return self's prefix."
        return self.s [ : self.startPos ]


# - - -   K w i c R e f . k e y w o r d   - - -

    def keyword ( self ):
        "Return self's keyword."
        return self.s [ self.startPos : self.endPos ]


# - - -   K w i c R e f . s u f f i x   - - -

    def suffix ( self ):
        "Return self's suffix."
        return self.s [ self.endPos : ]


# - - -   K w i c R e f . _ _ s t r _ _   - - -

    def __str__ ( self ):
        "Return self's entire context string."
        return self.s


# - - -   K w i c R e f . _ _ c m p _ _   - - -

    def __cmp__ ( self, other ):
        """Compare two KwicRef objects lexically.

          The effective key value of an entry is the keyword,
          followed by one space, followed by the suffix and then
          the prefix as tiebreakers.  The space causes all the
          lines with the same keyword to group together; this
          was added pursuant to a bug found 1996-10-13 in the
          Icon version.

          Example: the left-hand column shows the (keyword, suffix)
          and the right-hand column shows how they'd sort without
          the extra space:
                ("icon", "setting",...)     ICONSETTING...
                ("icons", "ileaf",...)      ICONSILEAF...
                ("icon", "text",...)        ICONTEXT...
          but that second line should be with "icons", not with "icon".
        """

        #-- 1 --
        # [ keyA  :=  self's keyword + " " + self's suffix + self's prefix,
        #             upshifted
        #   keyB  :=  other's keyword + " " + other's suffix + other's
        #             prefix, upshifted ]
        format  =  "%s %s%s"
        keyA  =  format % (self.keyword(), self.suffix(), self.prefix())
        keyB  =  format % (other.keyword(), other.suffix(), other.prefix())

        #-- 2 --
        return  cmp ( keyA.upper(), keyB.upper() )



# - - - - -   c l a s s   E x c l u s i o n L i s t   - - - - -

class ExclusionList:
    """Represents the exclusion file, listing keywords not to be indexed."

      State/Invariants:
        self.__set:     [ a Set object containing self's keywords ]
    """

# - - -   E x c l u s i o n L i s t . _ _ i n i t _ _   - - -

    def __init__ ( self, exclusionFileName=None, keyCset=None ):
        "Constructor for ExclusionList.  Reads the exclusion file."

        #-- 1 --
        # [ self.__set      :=  a new, empty Set object ]
        self.__set      =  Set()

        #-- 2 --
        if  exclusionFileName is None:
            return

        #-- 3 --
        # [ if exclusionFileName names a readable file ->
        #     exclusionFile  :=  that file
        #   else -> raise IOError ]
        exclusionFile  =  open ( exclusionFileName )

        #-- 4 --
        # [ self.__set  +:=  clumps of characters in keyCset
        #       found in exclusionFile, upshifted ]
        for  line in exclusionFile:
            #-- 4 body --
            # [ self.__set  +:=  clumps of characters in keyCset
            #       found in line ]
            self.__readLine ( line, keyCset )

        #-- 5 --
        exclusionFile.close()


# - - -   E x c l u s i o n L i s t . _ _ r e a d L i n e   - - -


    def __readLine ( self, line, keyCset ):
        """Extract the words from one line of the exclusion file.

          [ line is a string ->
              self.__set  +:=  elements from keywords found in line,
                               upshifted ]
        """

        #-- 1 iteration --
        # [ startx,endx  :=  the starting and ending slice indices of
        #       each clump of characters in keyCset found in line,
        #       upshifted, in turn ]
        for  startx, endx in findKeywords ( line, keyCset ):
            #-- 1 body --
            # [ self.__set  +:=  line[startx:endx], upshifted ]
            self.__set.add ( line[startx:endx].upper() )


# - - -   E x c l u s i o n L i s t . _ _ c o n t a i n s _ _   - - -

    def __contains__ ( self, x ):
        "Does self contain x, case-insensitive?"
        return  x.upper() in self.__set


To unsubscribe from this list, send a post to docbook-apps-unsubscribe@lists.oasis-open.org, or visit http://www.oasis-open.org/mlmanage/.
References:
- Re: [docbook-apps] Permuted Index, may be OT
  - From: Riaan Bredenkamp
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]