This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Fw: Bug#180914: sed: Another .* crash


    > I will try distilling a test case that uses regcomp and regexec

regcomp and regexec do not support \' so the crash should be considered
sed's bug.  However \'.* seems to match always, rather than only at end of
buffer.

Here is a scheme of the DFA that the matcher has created at the end
of re_compile_pattern:

node 0
{opr.ctx_type = NEXT_ENDBUF_CONSTRAINT, type = ANCHOR, constraint = 0,
duplicated = 0}
        epsilon transition to 4
        transition to 2

node 1
{type = OP_PERIOD, constraint = 0, duplicated = 0}
        transition to 2

node 2
{type = OP_DUP_ASTERISK, constraint = 0, duplicated = 0}
        epsilon transition to 1 and 3
        transition to 3

node 3
{type = 8}, type = END_OF_RE, constraint = 0, duplicated = 0}
        no outgoing edge

node 4
{type = OP_DUP_ASTERISK, constraint = NEXT_ENDBUF_CONSTRAINT,
duplicated = 1}
        epsilon transition to 5 and 6
        transition to 0

node 5
{type = OP_PERIOD, constraint = NEXT_ENDBUF_CONSTRAINT, duplicated = 1}
        transition to 2 <--------

node 6
{type = END_OF_RE, constraint = NEXT_ENDBUF_CONSTRAINT, duplicated = 1}
        no outgoing edge


The transition marked with the arrow seems wrong to me, because it drops the
constraint on being at the end of the buffer.  Indeed
check_halt_node_context is called with nodes 0,4,5,6, but then in
check_matching we proceed with the DFA to find the longest match and we
have:

(gdb) p *cur_state->nodes.elems@(cur_state->nodes.nelem)
$14 = {0, 4, 5, 6}
1027          cur_state = transit_state (&err, preg, mctx, cur_state,
...
1027          cur_state = transit_state (&err, preg, mctx, cur_state,
(gdb) p *cur_state->nodes.elems@(cur_state->nodes.nelem)
$20 = {1, 2, 3}

Gotcha.

The nodes are created by duplicate_node_closure and duplicate_node, which
both indeed transform epsilon transitions out of an ANCHOR into constraints:

  if (dfa->nodes[org_idx].type == ANCHOR)
    dfa->nodes[dup_idx].constraint |= dfa->nodes[org_idx].opr.ctx_type;
  dfa->nodes[dup_idx].duplicated = 1;

Neither however makes an attempt to use the duplicates in a transition (they
do for nodes to which there is an epsilon transition):

        {
          /* In case of the node can't epsilon-transit, don't duplicate the
             destination and store the original destination as the
             destination of the node.  */
          dfa->nexts[clone_node] = dfa->nexts[org_node];
          break;
        }

The more common regex \>.* does not expose the bug because build_trtable
seems to special case some constraints:

  /* If the new state has context constraint,
     build appropriate states for these contexts.  */
  if (dest_states[i]->has_constraint)
    {
      dest_states_word[i] = re_acquire_state_context (&err, dfa, &follows,
                                                      CONTEXT_WORD);
      if (BE (dest_states_word[i] == NULL && err != REG_NOERROR, 0))
        goto out_free;
      dest_states_nl[i] = re_acquire_state_context (&err, dfa, &follows,
                                                    CONTEXT_NEWLINE);
      if (BE (dest_states_nl[i] == NULL && err != REG_NOERROR, 0))
        goto out_free;
    }

Either duplicate_node is wrong if the destination had been duplicated, or
build_trtable is incomplete.

Now, after spending over an hour on this, a little whining: to figure out
all this, I had to understand not only the code (obviously), but also the
data structures.  What I liked about the old matcher was that it had a
lot of debugging utilities to print the NFA and so on; having this inside
the new matcher would be great.

Paolo



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]