| Summary: | mishandled '\B' | ||
|---|---|---|---|
| Product: | glibc | Reporter: | Stepan Kasal <skasal> |
| Component: | regex | Assignee: | GOTO Masanori <gotom> |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | CC: | arnold, glibc-bugs-regex, glibc-bugs, tee |
| Priority: | P2 | Flags: | fweimer:
security-
|
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Host: | Target: | ||
| Build: | Last reconfirmed: | ||
| Project(s) to access: | ssh public key: | ||
| Bug Depends on: | |||
| Bug Blocks: | 724 | ||
|
Description
Stepan Kasal
2005-01-25 10:31:44 UTC
If the '\B' funtionality is fixed, then the man page will also need to be updated. It currently says that '\B' "matches the empty string within a word." Also, I noticed that '\b' is not documented in the man page at all. Thank you. regex.texi:
@title Regex
@subtitle edition 0.12a
@subtitle 19 September 1992
@author Kathryn A. Hargreaves
@author Karl Berry
has:
@cindex @samp{\B}
This operator (represented by @samp{\B}) matches the empty string within
a word. For example, @samp{c\Brat\Be} matches @samp{crate}, but
@samp{dirty \Brat} doesn't match @samp{dirty rat}.
so to me this seems that current glibc regex works as documented.
When Kathy and I wrote that description of \B more than a decade ago, we did not have any deep reasoning behind it. In fact, we never thought of the possibility that "inverse of \b" and "empty string within a word" were two different things. We were just trying to give simple examples. The manual was never meant to be taken as gospel or a standard, we were more trying to describe how things worked than as a prescription of how things should work. (It also desperately needs updating.) So, I'm sorry that so much effort has gone into implementing our off-the-cuff description of \B. However, it seems to me that it would be better for users if \B in the new regex had the same definition as it's always had -- not \b. I don't see any advantage to being incompatible with the past here; just the opposite. I'll second Karl's motion here; the current regex should be fixed to work like the old one did. This brings dfa and regex back into line, which both grep and gawk need, and provides backwards compatibility. The manual can and probably should be changed, although that's a separate issue. The compatibility with perl is also a welcome thing to have. Hi. First, I appreciate the quick response on this issue. I am thus
saddened to say that the behavior between the current CVS and the original
regex is not identical. And to be honest, I'm not sure which is "correct".
Here's a script showing the difference:
$ cat typescript
Script started on Sun Jan 30 14:03:31 2005
bash-2.05b$ cat gnureop2.awk
BEGIN {
print (" " ~ / \B /) # test dfa matcher
a = " "
gsub(/\B/, "x", a) # test regex matcher
print a
}
bash-2.05b$ gawk-3.1.1 -f gnureop2.awk # old regex
1
x
bash-2.05b$ gawk-3.1.4 -f gnureop2.awk # previous glibc regex
1
bash-2.05b$ ./gawk -f gnureop2.awk # current CVS glibc regex
1
x x x
bash-2.05b$
Script done on Sun Jan 30 14:04:26 2005
Subject: Re: mishandled '\B'
> bash-2.05b$ gawk-3.1.1 -f gnureop2.awk # old regex
> 1
> x
> bash-2.05b$ ./gawk -f gnureop2.awk # current CVS glibc regex
> 1
> x x x
It looks like old regex special-cased the first and last character so
that is was neither a word character nor a non-word character. The
current behavior is more consistent. FWIW, PCRE also shows the same
behavior as current CVS glibc regex.
Paolo
(In reply to comment #6) > print (" " ~ / \B /) # test dfa matcher I tried the following: $ gawk 'BEGIN{print "a b" ~ /\B/}' 0 $ gawk 'BEGIN{print " b" ~ /\B/}' 1 $ gawk 'BEGIN{print "a " ~ /\B/}' 1 This proves that the dfa matcher has the same opinion of the current CVS regex. I'd to conclude that the old regex contained a bug which was not discovered until now, when it is dead for more than 2 years. (FWIW, I have also verified that perl has the same behaviour as PCRE and new regex and dfa.c.) The current regex.c seems OK. Thank you again, Jakub. At this point, I too am inclined to stay with current CVS regex behavior. To the best of my knowledge, Emacs still uses the old regex; it may or may not be worthwhile mentioning this to RMS or whoever maintains Emacs. Then again, it may also be best to let sleeping dogs lie. :-) Thanks again to Jakub and everyone else. -- Arnold Subject: Bug 693 CVSROOT: /cvs/glibc Module name: libc Branch: glibc-2_3-branch Changes by: roland@sources.redhat.com 2005-02-16 11:09:25 Modified files: posix : bug-regex19.c regcomp.c tst-rxspencer.c regex_internal.h posix/rxspencer: tests Log message: 2005-01-26 Jakub Jelinek <jakub@redhat.com> [BZ #693] * posix/regex_internal.h (DUMMY_CONSTRAINT): Rename to... (WORD_DELIM_CONSTRAINT): ...this. (NOT_WORD_DELIM_CONSTRAINT): Define. (re_context_type): Add INSIDE_NOTWORD and NOT_WORD_DELIM, change WORD_DELIM to use WORD_DELIM_CONSTRAINT. * posix/regcomp.c (peek_token): For \B create NOT_WORD_DELIM anchor instead of INSIDE_WORD. (parse_expression): Handle NOT_WORD_DELIM constraint. * posix/bug-regex19.c (tests): Adjust tests that relied on \B being inside word instead of not word delim. * posix/tst-rxspencer.c (mb_frob_pattern): Don't frob escaped characters. * posix/rxspencer/tests: Add some new tests. Patches: http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/bug-regex19.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.6&r2=1.6.4.1 http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/regcomp.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.87&r2=1.87.2.1 http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/tst-rxspencer.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.7&r2=1.7.4.1 http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/regex_internal.h.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.56&r2=1.56.2.1 http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/rxspencer/tests.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.5&r2=1.5.2.1 |