This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Regular expression functions (Was: Re: comments on December F&O draft)


It's great to see regular expression support in the F&O WD :)

David Carlisle wrote:
> xf:match
> This seems to be underspecified in cases that the matching regions
> overlap. if the regexp is aa and the string is aaa do you just get
> (1) or (1 2) (this also applies to xf:replace)

Most regular expression languages don't find overlapping matches, do
they? It seems to add a lot of extra complexity if they do.

> Slightly worried that, since xpath sequences do not nest, this
> semantic will prevent any future extension to allow sed/emacs/perl
> style numbered subexpressions. Also it forces the system always to
> match the entire string, which may be rather long, rather than
> stopping once a match is found.
>
> If instead it just returned the position of the first match a
> plausible extension would be that if the regexp was
> \(aa\)xx\(bb\)
> then what was returned was a sequence consisting of the position of
> the entire match follwed by the positions of each of the
> subexpressions.
> a future extension to xf:replace could then use (something
> equivalent to &1 or $1 or \1 in current regexp languages) to access
> the matched subexpressions in the replacement text.

In the description of xf:replace() it says:

  The value of $repval may use the standard regular expression syntax
  of "$N" (where N is some integer) to represent the N-th part of the
  matched pattern indicated by parentheses in the value of $regexp.

So it seems that the intention is that you can pull out specific
subexpressions, as illustrated in the example:

  replace("aFOOa aBARa", "a(.*)a", "b$1b")
    => "bFOOb bBARb"
    

Part of the reason, I think, that the xf:match() and xf:replace()
functions are so under-specified is that the regular expression syntax
in XML Schema Datatypes is just not designed for this kind of use - it
is purely designed for testing whether an entire string matches a
particular regular expression.

Thus there is no support in XML Schema regular expressions for things
that are common in other regular expressions and the functions that
use them:

  - matches covering the entirety of the string, or a portion of the
    string
  - meta-characters matching the start and end of the string
  - single vs. global matches
  - greedy vs. parsimonious matches
  - non-capturing matches
  - backreferences within regular expressions

I think that the difference between a match on a portion of the string
and on the string as a whole could be managed by introducing
meta-characters matching the start and end of the string - if you want
the regular expression to match the entire string, then you can always
add these characters at the start and end of the regular expressions.

Taking the usual ^ and $ to match the start and end of the string, for
example, given the string "aFOOa aBARa":

  "a(.*)a"    matches   "aFOOa"
                        "aBARa"

[assuming parsimonious and non-overlapping matches]
                        
whereas:

  "^a(.*)a$"  matches   "aFOOa aBARa"

I think that these would be useful generally, to support a
regular-expression starts-with()-type function.


Looking at the individual vs. global match, with an individual match
on the string "aFOOa aBARa", the regular expression "a(.*)a" would
match "aFOOa". With a global match it would match the two strings
"aFOOa" and "aBARa" (assuming non-overlapping matches).

One option would be to always do a global match, with the xf:match()
function returning the sequence of all the matched strings, thus:

  xf:match("aFOOa aBARa", "a(.*)a")
    => ("aFOOa", "aBARa")

The user could then take the first of these results to get the same
result as with a single match. However, as David pointed out, this
would lead to problems if you had a long string or if you had
something like:

  xf:match("aFOOa aBARa", ".")

since "." could match any character within the match string.


Which leads on to the question of parsimonious or greedy matches. From
what I gather, regular expressions elsewhere usually do greedy
matches, where, given the string "aFOOa aBARa", the regular expression
"a(.*)a" would match the entire string, since it starts and ends with
an 'a'. Given this, the example:

  replace("aFOOa aBARa", "a(.*)a", "b$1b")

in the F&O WD should actually return
  
  "bFOOa aBARb"

rather than "bFOOb bBARb" as given in the F&O WD. Given that greedy
matches are the norm in other languages, I think they should be the
norm here. Supporting parsimonious matches would involve supplementing
the XML Schema regular expression syntax, for example by allowing a
"?" after quantifiers. To match "aFOOa" rather than "aFOOa aBARa" you
would use the regular expression "a(.*?)a".


On what's returned by xf:match(), I think that getting the index of
the start of the match is insufficient - you also need to know the
length of the matched string in order to do anything useful with the
results of the match.

(Of course in some situations you might not be interested in the
results of the match, just in whether or not the string matches - for
this reason, I think a xf:test() function with the signature:

  xf:test(string? $srcval, string? $regexp) => boolean

would be useful, returning true if the string matches the regular
expression at all. Alternatively, you could have regular expression
versions of the current string manipulation functions contains(),
starts-with() and ends-with(), and possibly even of substring-before()
and substring-after().)

There are a couple of possibilities about the result of xf:match() -
you could have a sequence of pairs of integers, each giving the start
index and length of the matched string. Or you could return a
sequence of the matched strings themselves (which would only be a
single string if the match was not global).

I don't think that the xf:match() function needs to return the
positions of the subexpressions, or the subexpressions themselves,
because that functionality could be achieved via xf:replace(). For
example, to find out what string was matched by the first
subexpression you could just use "$1" as the replace value.

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]