The GNU extension '\B' has always meant non-\b. The dfa.[ch] code included in grep and gawk still handles it this way. Try echo ' ' | grep ' \B ' on your system, try gsub(/ \B /,...) in gawk-3.1.1 or gawk '/ B /' with current gawk. I have checked Perl documentation; it also defines '\B' and non-\b. But current regex has changed to interpret '\B' as inword space. Try the above gsub with current awk. See also http://lists.gnu.org/archive/html/bug-gnu-utils/2005-01/msg00087.html IMHO, the current regex code is not correct.
If the '\B' funtionality is fixed, then the man page will also need to be updated. It currently says that '\B' "matches the empty string within a word." Also, I noticed that '\b' is not documented in the man page at all. Thank you.
regex.texi: @title Regex @subtitle edition 0.12a @subtitle 19 September 1992 @author Kathryn A. Hargreaves @author Karl Berry has: @cindex @samp{\B} This operator (represented by @samp{\B}) matches the empty string within a word. For example, @samp{c\Brat\Be} matches @samp{crate}, but @samp{dirty \Brat} doesn't match @samp{dirty rat}. so to me this seems that current glibc regex works as documented.
When Kathy and I wrote that description of \B more than a decade ago, we did not have any deep reasoning behind it. In fact, we never thought of the possibility that "inverse of \b" and "empty string within a word" were two different things. We were just trying to give simple examples. The manual was never meant to be taken as gospel or a standard, we were more trying to describe how things worked than as a prescription of how things should work. (It also desperately needs updating.) So, I'm sorry that so much effort has gone into implementing our off-the-cuff description of \B. However, it seems to me that it would be better for users if \B in the new regex had the same definition as it's always had -- not \b. I don't see any advantage to being incompatible with the past here; just the opposite.
I'll second Karl's motion here; the current regex should be fixed to work like the old one did. This brings dfa and regex back into line, which both grep and gawk need, and provides backwards compatibility. The manual can and probably should be changed, although that's a separate issue. The compatibility with perl is also a welcome thing to have.
Patch here: <http://sources.redhat.com/ml/libc-hacker/2005-01/msg00066.html>
Hi. First, I appreciate the quick response on this issue. I am thus saddened to say that the behavior between the current CVS and the original regex is not identical. And to be honest, I'm not sure which is "correct". Here's a script showing the difference: $ cat typescript Script started on Sun Jan 30 14:03:31 2005 bash-2.05b$ cat gnureop2.awk BEGIN { print (" " ~ / \B /) # test dfa matcher a = " " gsub(/\B/, "x", a) # test regex matcher print a } bash-2.05b$ gawk-3.1.1 -f gnureop2.awk # old regex 1 x bash-2.05b$ gawk-3.1.4 -f gnureop2.awk # previous glibc regex 1 bash-2.05b$ ./gawk -f gnureop2.awk # current CVS glibc regex 1 x x x bash-2.05b$ Script done on Sun Jan 30 14:04:26 2005
Subject: Re: mishandled '\B' > bash-2.05b$ gawk-3.1.1 -f gnureop2.awk # old regex > 1 > x > bash-2.05b$ ./gawk -f gnureop2.awk # current CVS glibc regex > 1 > x x x It looks like old regex special-cased the first and last character so that is was neither a word character nor a non-word character. The current behavior is more consistent. FWIW, PCRE also shows the same behavior as current CVS glibc regex. Paolo
(In reply to comment #6) > print (" " ~ / \B /) # test dfa matcher I tried the following: $ gawk 'BEGIN{print "a b" ~ /\B/}' 0 $ gawk 'BEGIN{print " b" ~ /\B/}' 1 $ gawk 'BEGIN{print "a " ~ /\B/}' 1 This proves that the dfa matcher has the same opinion of the current CVS regex. I'd to conclude that the old regex contained a bug which was not discovered until now, when it is dead for more than 2 years. (FWIW, I have also verified that perl has the same behaviour as PCRE and new regex and dfa.c.) The current regex.c seems OK. Thank you again, Jakub.
At this point, I too am inclined to stay with current CVS regex behavior. To the best of my knowledge, Emacs still uses the old regex; it may or may not be worthwhile mentioning this to RMS or whoever maintains Emacs. Then again, it may also be best to let sleeping dogs lie. :-) Thanks again to Jakub and everyone else. -- Arnold
Subject: Bug 693 CVSROOT: /cvs/glibc Module name: libc Branch: glibc-2_3-branch Changes by: roland@sources.redhat.com 2005-02-16 11:09:25 Modified files: posix : bug-regex19.c regcomp.c tst-rxspencer.c regex_internal.h posix/rxspencer: tests Log message: 2005-01-26 Jakub Jelinek <jakub@redhat.com> [BZ #693] * posix/regex_internal.h (DUMMY_CONSTRAINT): Rename to... (WORD_DELIM_CONSTRAINT): ...this. (NOT_WORD_DELIM_CONSTRAINT): Define. (re_context_type): Add INSIDE_NOTWORD and NOT_WORD_DELIM, change WORD_DELIM to use WORD_DELIM_CONSTRAINT. * posix/regcomp.c (peek_token): For \B create NOT_WORD_DELIM anchor instead of INSIDE_WORD. (parse_expression): Handle NOT_WORD_DELIM constraint. * posix/bug-regex19.c (tests): Adjust tests that relied on \B being inside word instead of not word delim. * posix/tst-rxspencer.c (mb_frob_pattern): Don't frob escaped characters. * posix/rxspencer/tests: Add some new tests. Patches: http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/bug-regex19.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.6&r2=1.6.4.1 http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/regcomp.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.87&r2=1.87.2.1 http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/tst-rxspencer.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.7&r2=1.7.4.1 http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/regex_internal.h.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.56&r2=1.56.2.1 http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/rxspencer/tests.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.5&r2=1.5.2.1