Bug 693 - mishandled '\B'
Summary: mishandled '\B'
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: regex (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: GOTO Masanori
URL:
Keywords:
Depends on:
Blocks: libc235
  Show dependency treegraph
 
Reported: 2005-01-25 10:31 UTC by Stepan Kasal
Modified: 2019-04-10 09:12 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
Project(s) to access:
ssh public key:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stepan Kasal 2005-01-25 10:31:44 UTC
The GNU extension '\B' has always meant non-\b.
The dfa.[ch] code included in grep and gawk still handles it this way.
Try echo '  ' | grep ' \B ' on your system, try gsub(/ \B /,...) in gawk-3.1.1
or gawk '/ B /' with current gawk.

I have checked Perl documentation; it also defines '\B' and non-\b.

But current regex has changed to interpret '\B' as inword space.  Try the above
gsub with current awk.

See also http://lists.gnu.org/archive/html/bug-gnu-utils/2005-01/msg00087.html

IMHO, the current regex code is not correct.
Comment 1 Tony Ernst 2005-01-25 13:53:28 UTC
If the '\B' funtionality is fixed, then the man page will also need to be updated.
It currently says that '\B' "matches the empty string within a word."  Also, I
noticed that '\b' is not documented in the man page at all.

Thank you.
Comment 2 Jakub Jelinek 2005-01-25 14:08:17 UTC
regex.texi:
@title Regex
@subtitle edition 0.12a
@subtitle 19 September 1992
@author Kathryn A. Hargreaves
@author Karl Berry

has:
@cindex @samp{\B}

This operator (represented by @samp{\B}) matches the empty string within
a word. For example, @samp{c\Brat\Be} matches @samp{crate}, but
@samp{dirty \Brat} doesn't match @samp{dirty rat}.

so to me this seems that current glibc regex works as documented.
Comment 3 Karl Berry 2005-01-25 17:11:26 UTC
When Kathy and I wrote that description of \B more than a decade ago, we did not
have any deep reasoning behind it.  In fact, we never thought of the possibility
that "inverse of \b" and "empty string within a word" were two different things.
 We were just trying to give simple examples.  

The manual was never meant to be taken as gospel or a standard, we were more
trying to describe how things worked than as a prescription of how things should
work. (It also desperately needs updating.)

So, I'm sorry that so much effort has gone into implementing our off-the-cuff
description of \B.  However, it seems to me that it would be better for users if
\B in the new regex had the same definition as it's always had -- not \b.  I
don't see any advantage to being incompatible with the past here; just the opposite.
Comment 4 Arnold Robbins 2005-01-26 13:44:18 UTC
I'll second Karl's motion here; the current regex should be fixed to work like
the old one did.  This brings dfa and regex back into line, which both grep
and gawk need, and provides backwards compatibility.  The manual can and probably
should be changed, although that's a separate issue.  The compatibility with
perl is also a welcome thing to have.
Comment 5 Jakub Jelinek 2005-01-26 17:33:30 UTC
Patch here: <http://sources.redhat.com/ml/libc-hacker/2005-01/msg00066.html>
Comment 6 Arnold Robbins 2005-01-30 12:09:29 UTC
Hi. First, I appreciate the quick response on this issue.  I am thus
saddened to say that the behavior between the current CVS and the original
regex is not identical.  And to be honest, I'm not sure which is "correct".
Here's a script showing the difference:

$ cat typescript 
Script started on Sun Jan 30 14:03:31 2005
bash-2.05b$ cat gnureop2.awk 
BEGIN {
        print ("  " ~ / \B /)   # test dfa matcher
        a = "  "
        gsub(/\B/, "x", a)      # test regex matcher
        print a
}
bash-2.05b$ gawk-3.1.1 -f gnureop2.awk # old regex
1
 x 
bash-2.05b$ gawk-3.1.4 -f gnureop2.awk # previous glibc regex
1
  
bash-2.05b$ ./gawk -f gnureop2.awk      # current CVS glibc regex
1
x x x
bash-2.05b$ 
Script done on Sun Jan 30 14:04:26 2005
Comment 7 paolo.bonzini@lu.unisi.ch 2005-01-30 13:33:06 UTC
Subject: Re:  mishandled '\B'

> bash-2.05b$ gawk-3.1.1 -f gnureop2.awk # old regex
> 1
>  x 
> bash-2.05b$ ./gawk -f gnureop2.awk      # current CVS glibc regex
> 1
> x x x

It looks like old regex special-cased the first and last character so 
that is was neither a word character nor a non-word character.  The 
current behavior is more consistent.  FWIW, PCRE also shows the same 
behavior as current CVS glibc regex.

Paolo
Comment 8 Stepan Kasal 2005-01-31 05:40:22 UTC
(In reply to comment #6)
>         print ("  " ~ / \B /)   # test dfa matcher

I tried the following:
$ gawk 'BEGIN{print "a b" ~ /\B/}'
0
$ gawk 'BEGIN{print " b" ~ /\B/}'
1
$ gawk 'BEGIN{print "a " ~ /\B/}'
1

This proves that the dfa matcher has the same opinion of the current CVS regex.

I'd to conclude that the old regex contained a bug which was not discovered
until now, when it is dead for more than 2 years.

(FWIW, I have also verified that perl has the same behaviour as PCRE and new
regex and dfa.c.)

The current regex.c seems OK.  Thank you again, Jakub.
Comment 9 Arnold Robbins 2005-01-31 07:53:58 UTC
At this point, I too am inclined to stay with current CVS regex behavior.
To the best of my knowledge, Emacs still uses the old regex; it may or may not
be worthwhile mentioning this to RMS or whoever maintains Emacs.  Then again,
it may also be best to let sleeping dogs lie. :-)

Thanks again to Jakub and everyone else. -- Arnold
Comment 10 Sourceware Commits 2005-02-16 11:09:41 UTC
Subject: Bug 693

CVSROOT:	/cvs/glibc
Module name:	libc
Branch: 	glibc-2_3-branch
Changes by:	roland@sources.redhat.com	2005-02-16 11:09:25

Modified files:
	posix          : bug-regex19.c regcomp.c tst-rxspencer.c 
	                 regex_internal.h 
	posix/rxspencer: tests 

Log message:
	2005-01-26  Jakub Jelinek  <jakub@redhat.com>
	
	[BZ #693]
	* posix/regex_internal.h (DUMMY_CONSTRAINT): Rename to...
	(WORD_DELIM_CONSTRAINT): ...this.
	(NOT_WORD_DELIM_CONSTRAINT): Define.
	(re_context_type): Add INSIDE_NOTWORD and NOT_WORD_DELIM,
	change WORD_DELIM to use WORD_DELIM_CONSTRAINT.
	* posix/regcomp.c (peek_token): For \B create NOT_WORD_DELIM
	anchor instead of INSIDE_WORD.
	(parse_expression): Handle NOT_WORD_DELIM constraint.
	* posix/bug-regex19.c (tests): Adjust tests that relied on \B
	being inside word instead of not word delim.
	* posix/tst-rxspencer.c (mb_frob_pattern): Don't frob escaped
	characters.
	* posix/rxspencer/tests: Add some new tests.

Patches:
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/bug-regex19.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.6&r2=1.6.4.1
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/regcomp.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.87&r2=1.87.2.1
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/tst-rxspencer.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.7&r2=1.7.4.1
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/regex_internal.h.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.56&r2=1.56.2.1
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/rxspencer/tests.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.5&r2=1.5.2.1