This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

coding neutral regexec to do UTF-8 ranges

From: Joël Krähemann <jkraehemann at gmail dot com>
To: libc-help at sourceware dot org
Date: Fri, 23 Jun 2017 11:50:35 +0200
Subject: coding neutral regexec to do UTF-8 ranges
Authentication-results: sourceware.org; auth=none
Reply-to: jkraehemann-guest at users dot alioth dot debian dot org

Hi

The regexec() function has got some issues as computing UTF-8 ranges.
Since it requires the environment variables to be set like:

LANG=C
LC_ALL=C

My application is not able to apply any gettext translations. Here is
a sample of such an expression used by my application:

static const char *chars_pattern =
"^(([0-9])|(\xC2\xB7)|((\xCC[\x80-\xBF])|(\xCD[\x80-\xAF]))|((\xE2\x80\xBF)|(\xE2\x81\x80)))";

The situation now is as using any UTF-8 encoding on my system. The
expression above causes program failure. Since it does interpret the
ranges as multi-byte sequence. What is definitely wrong in this
situation.

http://www.nongnu.org/gsequencer/

The file using UTF-8 ranges:
http://git.savannah.nongnu.org/cgit/gsequencer.git/tree/ags/lib/ags_turtle.c?h=0.8.x

The main function setting environment variables:
http://git.savannah.nongnu.org/cgit/gsequencer.git/tree/ags/gsequencer_main.c?h=0.8.x

Bests,
Joël

Follow-Ups:
- Re: coding neutral regexec to do UTF-8 ranges
  - From: Florian Weimer

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]