Bug 23447

Summary: echo w | grep '[a-z] not working in swedish locale
Product: glibc Reporter: Jakob sellgren <jakob_sellgren>
Component: localeAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED DUPLICATE    
Severity: normal CC: carlos, ricilake
Priority: P2 Flags: fweimer: security-
Version: 2.27   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed: 2018-07-24 00:00:00

Description Jakob sellgren 2018-07-24 13:43:57 UTC
~/Git/glibc/build:1620> cat /etc/fedora-release 
Fedora release 28 (Twenty Eight)
~/Git/glibc/build:1621> rpm -q glibc
glibc-2.27-19.fc28.x86_64
~/Git/glibc/build:1622> echo w |  grep '[a-z]'
~/Git/glibc/build:1623> git status
HEAD frånkopplad vid glibc-2.27
Ändringar att checka in:
  (använd "git reset HEAD <fil>..." för att ta bort från kö)

	ändrad:        ../localedata/locales/sv_SE

Ospårade filer:
  (använd "git add <fil>..." för att ta med i det som skall checkas in)

	./
	../config.log

~/Git/glibc/build:1624> echo w | LD_LIBRARY_PATH=~/Git/glibc/build/ grep '[a-z]'
w
~/Git/glibc/build:1625> 



The update of sv_SE was made with the following command:
git checkout cae87e64dca14f50da7bbd99085c7f5e413ad0f8 localedata/locales/sv_SE
Comment 1 Carlos O'Donell 2018-07-24 19:27:14 UTC
(In reply to Jakob sellgren from comment #0)
> ~/Git/glibc/build:1624> echo w | LD_LIBRARY_PATH=~/Git/glibc/build/ grep
> '[a-z]'

You cannot test like this. This uses the new library with the old locale data. When the locale data format changes this will fall back to C/POSIX for locales.

You need to recompile the locale in question and then use the appropriate LOCPATH and LC_ALL settings to use the new locale.

For details please see:
https://sourceware.org/glibc/wiki/Locales#Testing_Locales

Could you please retest?
Comment 2 Jakob sellgren 2018-07-25 04:29:15 UTC
$> LOCALE=sv_SE
$> export LOCPATH=$HOME/tmp/locale-test
$> mkdir -p $LOCPATH
$> I18NPATH=~/git_projects/glibc/localedata/ localedef -f UTF-8 -i $LOCALE $LOCPATH/$LOCALE.UTF-8
$> ls -l ~/tmp/locale-test/
totalt 4
drwxrwxr-x 3 jakob jakob 4096 25 jul 05.59 sv_SE.UTF-8/
$> echo w | grep '[a-z]'
w
$> unset LOCPATH
$> echo w | grep '[a-z]'

So it works with LOCPATH set, but not unset (I haven't studied the locale that much to see what is wrong here).
Comment 3 Carlos O'Donell 2018-07-25 13:42:48 UTC
(In reply to Jakob sellgren from comment #2)
> $> LOCALE=sv_SE
> $> export LOCPATH=$HOME/tmp/locale-test
> $> mkdir -p $LOCPATH
> $> I18NPATH=~/git_projects/glibc/localedata/ localedef -f UTF-8 -i $LOCALE
> $LOCPATH/$LOCALE.UTF-8
> $> ls -l ~/tmp/locale-test/
> totalt 4
> drwxrwxr-x 3 jakob jakob 4096 25 jul 05.59 sv_SE.UTF-8/
> $> echo w | grep '[a-z]'
> w
> $> unset LOCPATH
> $> echo w | grep '[a-z]'
> 
> So it works with LOCPATH set, but not unset (I haven't studied the locale
> that much to see what is wrong here).

Perfect, that is working as intended then, you must set LOCPATH and you must use the new locale that comes with the library. You cannot upgrade the library and use old binary locales, all of the old binary locales need to be recompiled to match the new library.
Comment 4 Rici Lake 2018-10-02 04:03:46 UTC
Please reopen this bug. It was reported again on StackOverflow here, and I subsequently tested it, to the best of my ability, with glibc 2.28, using the included version of localedef to compile the included locale file for sv_SE. I used the following C program, rather than grep, because the problem is readily visible with a simple regular expression:

#define _XOPEN_SOURCE 700
#include <locale.h>
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>

const char* pat = "[u-x]";
int main(int argc, char* argv[]) {
  const char* loc = setlocale(LC_ALL, "sv_SE.UTF-8");
  if (loc) printf("Locale: %s\n", loc); else exit(2);
  regex_t re;
  int status;
  if ((status = regcomp(&re, pat, REG_EXTENDED))) {
    char errbuf[128];
    regerror(status, &re, errbuf, 128);
    printf("Regex error %d: %s\n", status, errbuf);
    exit(1);
  }
  for (int i = 't'; i < 'z'; ++i) {
    char s[2] = {i, 0};
    printf("%c: %s\n", i, regexec(&re, s, 0, NULL, 0) == 0 ? "Ok" : "No");
  }
  return 0;
}

I used `make check` to generate the locale files in the build directory, and then compiled, linked and ran my little program using the same commands as were generated by the Makefile to run the bundled test programs:

$ (cd ../../glibc-2.28/localedata; gcc /home/rici/src/glibc-2.28-build/localedata/swedish.c -c -std=gnu11 -fgnu89-inline  -O2 -Wall -Werror -Wundef -Wwrite-strings -fmerge-all-constants -fno-stack-protector -frounding-math -g -Wstrict-prototypes -Wold-style-definition -fmath-errno         -U_FORTIFY_SOURCE   -I../include -I/home/rici/src/glibc-2.28-build/localedata  -I/home/rici/src/glibc-2.28-build  -I../sysdeps/unix/sysv/linux/x86_64/64  -I../sysdeps/unix/sysv/linux/x86_64  -I../sysdeps/unix/sysv/linux/x86/include -I../sysdeps/unix/sysv/linux/x86  -I../sysdeps/x86/nptl  -I../sysdeps/unix/sysv/linux/wordsize-64  -I../sysdeps/x86_64/nptl  -I../sysdeps/unix/sysv/linux/include -I../sysdeps/unix/sysv/linux  -I../sysdeps/nptl  -I../sysdeps/pthread  -I../sysdeps/gnu  -I../sysdeps/unix/inet  -I../sysdeps/unix/sysv  -I../sysdeps/unix/x86_64  -I../sysdeps/unix  -I../sysdeps/posix  -I../sysdeps/x86_64/64  -I../sysdeps/x86_64/fpu/multiarch  -I../sysdeps/x86_64/fpu  -I../sysdeps/x86/fpu/include -I../sysdeps/x86/fpu  -I../sysdeps/x86_64/multiarch  -I../sysdeps/x86_64  -I../sysdeps/x86  -I../sysdeps/ieee754/float128  -I../sysdeps/ieee754/ldbl-96/include -I../sysdeps/ieee754/ldbl-96  -I../sysdeps/ieee754/dbl-64/wordsize-64  -I../sysdeps/ieee754/dbl-64  -I../sysdeps/ieee754/flt-32  -I../sysdeps/wordsize-64  -I../sysdeps/ieee754  -I../sysdeps/generic  -I.. -I../libio -I.   -D_LIBC_REENTRANT -include /home/rici/src/glibc-2.28-build/libc-modules.h -DMODULE_NAME=testsuite -include ../include/libc-symbols.h  -DPIC     -DTOP_NAMESPACE=glibc -o /home/rici/src/glibc-2.28-build/localedata/swedish.o -MD -MP -MF /home/rici/src/glibc-2.28-build/localedata/swedish.o.dt -MT /home/rici/src/glibc-2.28-build/localedata/swedish.o)

$ gcc -pie -Wl,-O1 -nostdlib -nostartfiles -o /home/rici/src/glibc-2.28-build/localedata/swedish    -Wl,-z,combreloc -Wl,-z,relro -Wl,--hash-style=both /home/rici/src/glibc-2.28-build/csu/Scrt1.o /home/rici/src/glibc-2.28-build/csu/crti.o `gcc  --print-file-name=crtbeginS.o` /home/rici/src/glibc-2.28-build/localedata/swedish.o /home/rici/src/glibc-2.28-build/support/libsupport_nonshared.a  -Wl,-dynamic-linker=/lib64/ld-linux-x86-64.so.2 -Wl,-rpath-link=/home/rici/src/glibc-2.28-build:/home/rici/src/glibc-2.28-build/math:/home/rici/src/glibc-2.28-build/elf:/home/rici/src/glibc-2.28-build/dlfcn:/home/rici/src/glibc-2.28-build/nss:/home/rici/src/glibc-2.28-build/nis:/home/rici/src/glibc-2.28-build/rt:/home/rici/src/glibc-2.28-build/resolv:/home/rici/src/glibc-2.28-build/mathvec:/home/rici/src/glibc-2.28-build/support:/home/rici/src/glibc-2.28-build/crypt:/home/rici/src/glibc-2.28-build/nptl /home/rici/src/glibc-2.28-build/libc.so.6 /home/rici/src/glibc-2.28-build/libc_nonshared.a -Wl,--as-needed /home/rici/src/glibc-2.28-build/elf/ld.so -Wl,--no-as-needed -lgcc -Wl,--as-needed -lgcc_s  -Wl,--no-as-needed `gcc  --print-file-name=crtendS.o` /home/rici/src/glibc-2.28-build/csu/crtn.o

$ env GCONV_PATH=/home/rici/src/glibc-2.28-build/iconvdata LOCPATH=/home/rici/src/glibc-2.28-build/localedata LC_ALL=C   /home/rici/src/glibc-2.28-build/elf/ld-linux-x86-64.so.2 --library-path /home/rici/src/glibc-2.28-build:/home/rici/src/glibc-2.28-build/math:/home/rici/src/glibc-2.28-build/elf:/home/rici/src/glibc-2.28-build/dlfcn:/home/rici/src/glibc-2.28-build/nss:/home/rici/src/glibc-2.28-build/nis:/home/rici/src/glibc-2.28-build/rt:/home/rici/src/glibc-2.28-build/resolv:/home/rici/src/glibc-2.28-build/mathvec:/home/rici/src/glibc-2.28-build/support:/home/rici/src/glibc-2.28-build/crypt:/home/rici/src/glibc-2.28-build/nptl /home/rici/src/glibc-2.28-build/localedata/swedish


The output is:

Locale: sv_SE.UTF-8
t: No
u: Ok
v: Ok
w: No
x: Ok
y: No

The same output is produced if I use glibc v2.27 with the locale file created with localedef v2.28. On the other hand, both glibc v2.27 and glibc v2.28 produce the following output with a locale file created with localedef v2.27:

Locale: sv_SE.UTF-8
t: No
u: Ok
v: Ok
w: No
x: Ok
y: No

If I use the regular expression `[u-w]` instead of `[u-x]`, regcomp returns an error indication when the new locale files are used:

Regex error 11: Invalid range end

With the v2.27 locale files, there is no error and `w` is accepted by the regular expression.

I believe all these results are entirely predictable. They are the result of this patch, from last February: https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=159738548130d5ac4fe6178977e940ed5f8cfdc4#patch94

That patch changes the collation redefinitions for `W` and `w`, which had been:

-reorder-after <U0056>
-<U0057> <v>;<BAS>;<CAP>;<w> % W
-reorder-after <U0076>
-<U0077> <v>;<BAS>;<MIN>;<w> % w

In the patched file, they are

+% &v<<<V<<w<<<W
+<U0057> <S0076>;"<BASE><VRNT1>";"<CAP><MIN>";IGNORE % W
+<U0077> <S0076>;"<BASE><VRNT1>";"<MIN><MIN>";IGNORE % w

Note that the reorder-after directives have been removed. As a result, in the modified file, the redefinition of `W` and `w` appear as part of a stanza which starts

+reorder-after <AFTER-Z>

in which they appear between thorn and U-umlaut:

+<U00FE> "<S0074><S0068>";"<BASE><BASE>";"<COMPAT><COMPAT>";IGNORE % þ
+<U0057> <S0076>;"<BASE><VRNT1>";"<CAP><MIN>";IGNORE % W
+<U0077> <S0076>;"<BASE><VRNT1>";"<MIN><MIN>";IGNORE % w
+<U00DC> <S0079>;"<BASE><TREMA>";"<CAP><MIN>";IGNORE % Ü

Note that the weight orderings are correct and basically unaltered, although the symbolic names have changed. Indeed, the sortation tests included in the glibc distribution still succeed, because string comparison uses the ordered weightings. Furthermore, `W` and `w` are still recognized as part of `[[:alpha:]]`, because the LC_CTYPE definitions were not changed.

However, there is a quirk in the application of LC_COLLATE to regular expressions, which is described at length in the Posix rationales for Locales (http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap07.html) and Regular Expressions (http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html). As described by the LC_COLLATE section of Chapter 7, there are three different orderings which have been considered (out of a total of five possible orderings of increased sophistication, the last two of which were considered out of scope):

1. Binary (codepoint) order
2. Character Order
3. String Order

In strings, characters can (and are) sorted contextually. This allows, for example, certain German letters to be alphabetised as though they were spelled in an alternative fashion (ß is alphabetized as though written "ss", and ä as though written "ae"), while digraphs (still used in a few languages, but not in Spanish for many years) can be condensed into single letters. Posix collation is targetted at string ordering, but it is not always possible. So two orderings are defined. String ordering is defined by the vector of weights and collating sequences supplied for each character, but character ordering is simply defined by the order in which the collation elements appear in the locale definition file. ("The character order is defined by the order in which characters and elements are specified between the order_start and order_end keywords.")

This is made even more explicit in the rationale for Regular Expressions. There is a long historical overview, which starts with:

> Historical implementations used native character order to interpret range expressions. The ISO POSIX-2:1993 standard instead required collating element order (CEO): the order that collating elements were specified between the order_start and order_end keywords in the LC_COLLATE category of the current locale. CEO had some advantages in portability over the native character order, but it also had some disadvantages:

One might expect that the follow-up would be a different recommendation, but no; the subsequent discussions describes what must have been a long and frustrating committee debate with no clear outcome:

> While revising the standard, lengthy consideration was given to proposals to attack this problem by adding an API for querying the CEO to allow user-mode matchers, but none of these proposals had implementation experience and none achieved consensus. Leaving the standard alone was also considered, but rejected due to the problems described above.

So the end outcome:

> The current standard leaves unspecified the behavior of a range expression outside the POSIX locale.

However, in 7.3.2 of XBD, where the LC_COLLATE section of the locale definition file is described, the following text appears:

> At least the following capabilities are provided:

> 1. Multi-character collating elements. Specification of multi-character collating elements (that is, sequences of two or more characters to be collated as an entity).

> 2. User-defined ordering of collating elements. Each collating element shall be assigned a collation value defining its order in the character (or basic) collation sequence. This ordering is used by regular expressions and pattern matching and, unless collation weights are explicitly specified, also as the collation weight to be used in sorting.

> 3. Multiple weights and equivalence classes. Collating elements can be assigned one or more (up to the limit {COLL_WEIGHTS_MAX}, as defined in <limits.h>) collating weights for use in sorting. The first weight is hereafter referred to as the primary weight.

So there is at least a guideline: Collation Element Ordering is to be used by regular expressions; collating weights are to be used in sorting.

---

The reorder-after directive implemented by Glibc's localedef is not part of the Posix standard. The extension is useful precisely to allow modularised locale definition files to assign a useful Collation Element Ordering as well as reassigning sortation weights. That is what was done for `W` and `w` in the sv_SE locale up to glibc v2.27, and while the change in v2.28 does not appear to violate any standard, it nonetheless seems like a regression.

The simple solution would be to move lines 106 through 116 of localedata/locales/sv_SE (https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/sv_SE;h=aa28c23776408e593890883ebb4c8d70b971fe15;hb=159738548130d5ac4fe6178977e940ed5f8cfdc4#l106) out of the reorder-after <AFTER-Z> section, and reinsert the reorder-after blocks which were present in v2.27, placing `W` after `V` and `w` after `v`.
Comment 5 Carlos O'Donell 2018-10-02 13:32:56 UTC
(In reply to Rici Lake from comment #4)
> The simple solution would be to move lines 106 through 116 of
> localedata/locales/sv_SE
> (https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/sv_SE;
> h=aa28c23776408e593890883ebb4c8d70b971fe15;
> hb=159738548130d5ac4fe6178977e940ed5f8cfdc4#l106) out of the reorder-after
> <AFTER-Z> section, and reinsert the reorder-after blocks which were present
> in v2.27, placing `W` after `V` and `w` after `v`.

This is equivalent to supporting rational ranges:
"Handle [a-z] and [A-Z] in consistent portable fashion regardless of locale."
https://sourceware.org/bugzilla/show_bug.cgi?id=23393

I'm marking this as a duplicate of that bug.

*** This bug has been marked as a duplicate of bug 23393 ***