This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Speed-up character range regexes by up to 2x

From: Aharon Robbins <arnold at skeeve dot com>
To: paolo dot bonzini at polimi dot it
Cc: libc-alpha at sources dot redhat dot com
Date: Mon, 12 Jan 2004 12:30:09 +0200
Subject: Re: [PATCH] Speed-up character range regexes by up to 2x

> I got a speed improvement 
> over 40% matching [A-Z][0-9] against ABCDEFGHIJKLMNOPQRSTUVWXYZ.

Thanks, I'll try it out.

> What follows the review of the "gawk guy"'s regex patch:
>
>  > +#ifdef RE_ENABLE_I18N
>  >    int icase = (dfa->mb_cur_max == 1 && (bufp->syntax & RE_ICASE));
>  > +#else
>  > +  int icase = (bufp->syntax & RE_ICASE);
>  > +#endif
>
> This is unneeded.

Why?  Look at the code:

| static void
| re_compile_fastmap_iter (bufp, init_state, fastmap)
|      regex_t *bufp;
|      const re_dfastate_t *init_state;
|      char *fastmap;
| {
|   re_dfa_t *dfa = (re_dfa_t *) bufp->buffer;
|   int node_cnt;
| #ifdef RE_ENABLE_I18N
|   int icase = (dfa->mb_cur_max == 1 && (bufp->syntax & RE_ICASE));
| #else
|   int icase = (bufp->syntax & RE_ICASE);
| #endif
|   for (node_cnt = 0; node_cnt < init_state->nodes.nelem; ++node_cnt)
|     {
|       int node = init_state->nodes.elems[node_cnt];
|       re_token_type_t type = dfa->nodes[node].type;
| 
|       if (type == CHARACTER)
| 	{
| 	  re_set_fastmap (fastmap, icase, dfa->nodes[node].opr.c);
                                   ^^^^^
                                   ^^^^^

If RE_ENABLE_I18N isn't defined, then `icase' isn't declared.
Or am I missing something?

>  > @@ -2558,8 +2564,8 @@
>  >                 ? __btowc (start_ch) : start_elem->opr.wch);
>  >      end_wc = ((end_elem->type == SB_CHAR || end_elem->type == COLL_SYM)
>  >               ? __btowc (end_ch) : end_elem->opr.wch);
>  > -    cmp_buf[0] = start_wc;
>  > -    cmp_buf[4] = end_wc;
>  > +    cmp_buf[0] = start_wc != WEOF ? start_wc : start_ch;
>  > +    cmp_buf[4] = end_wc != WEOF ? end_wc : end_ch;
>  >      if (wcscoll (cmp_buf, cmp_buf + 4) > 0)
>  >        return REG_ERANGE;
>  
> I am not sure this is the fix; maybe it is better not to include the 
> character set if start_wc == WEOF || end_wc == WEOF, or to return 
> REG_ERANGE?

I was applying bandaids. I found that there were cases where I got WEOF
and things core dumped (or equivalent).  This made it work.  If there's
a better solution that involves actually understanding what's going on,
that's fine with me.

>  > +#if 0
>  > +/* Don't include this here. On some systems it sets RE_DUP_MAX to a
>  > + * lower value than GNU regex allows.  Instead, include it in
>  > + * regex.c, before include of <regex.h>, which correctly
>  > + * #undefs RE_DUP_MAX and sets it to the right value.
>  > + */
>  >  #include <limits.h>
>  > +#endif
>
> Can be completely removed?

OK by me.

>  > +#if _LIBC || __GNUC__ >= 3
>  > +# define BE(expr, val) __builtin_expect (expr, val)
>  > +#else
>  > +# define BE(expr, val) (expr)
>  > +# define inline
>  > +#endif
>  > +
>
> Isn't this already there?  Also, shouldn't "#define inline" be taken 
> care of in the configure script?

Don't know.  I had this diff in my code, so I forward ported it.
It may really be unncessary.

Thanks,

Arnold

Follow-Ups:
- Re: [PATCH] Speed-up character range regexes by up to 2x
  - From: bonzini

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]