This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug locale/22889] New: strcoll/strxfrm broken for most characters in GB18030

From: "stephane.chazelas+sourceware at gmail dot com" <sourceware-bugzilla at sourceware dot org>
To: glibc-bugs at sourceware dot org
Date: Sat, 24 Feb 2018 20:52:18 +0000
Subject: [Bug locale/22889] New: strcoll/strxfrm broken for most characters in GB18030
Auto-submitted: auto-generated

https://sourceware.org/bugzilla/show_bug.cgi?id=22889

            Bug ID: 22889
           Summary: strcoll/strxfrm broken for most characters in GB18030
           Product: glibc
           Version: 2.26
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: stephane.chazelas+sourceware at gmail dot com
  Target Milestone: ---

In the en_GB.UTF-8 locale

$ perl -C -le 'print chr$_ for 0x1D400..0x1D419' | sort -u
𝐀

That's not fine (bug18927), but expected. The order of those
𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙 mathematical letter characters is not defined and
sort the same.

Now, in a zh_CN.gb18030 locale:

$ LC_ALL=zh_CN.gb18030 locale charmap
GB18030
$ perl -C -le 'print chr$_ for 0x1D400..0x1D419' |
   (export LC_ALL=zh_CN.gb18030; iconv -f utf-8 | sort -u | iconv -t utf-8)
𝐈
𝐉
𝐀
𝐁
𝐂
𝐃
𝐄
𝐅
𝐆
𝐇

(where sort is GNU sort which uses strcoll). If we look at the strxfrm() output
of the first few letters, we see:

$ export LC_ALL=zh_CN.gb18030
$ ./strxfrm $'\U1D400' | od -An -vtx1
 05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
$ ./strxfrm $'\U1D401' | od -An -vtx1
 05 05 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
[...]
$ ./strxfrm $'\U1D409' | od -An -vtx1
 05 03 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
$ ./strxfrm $'\U1D40a' | od -An -vtx1
 05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
$ ./strxfrm $'\U1D40b' | od -An -vtx1
 05 05 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6

where strxfrm.c is

#include <locale.h>
#include <string.h>
#include <stdio.h>

int main(int argc, char* argv[])
{
  char buf[4096];
  setlocale(LC_ALL, "");
  strxfrm(buf, argv[1], sizeof(buf));
  printf("%s", buf);
  return 0;
}

There are 10 different strxfrm() outcomes until it loops back to the beginning.

If we look at those characters:

$ printf '\U1D400' | od -An -vtx1 -vtc
  94  33  8a  32
 224   3 212   2
$ printf '\U1D401' | od -An -vtx1 -vtc
  94  33  8a  33
 224   3 212   3
$ printf '\U1D40a' | od -An -vtx1 -vtc
  94  33  8b  32
 224   3 213   2

See how the last byte of both U+1D400 and U+1D40A is 0x32, the encoding of "2".

The strxfrm of "2" is:

$ ./strxfrm 2 | od -An -vtx1
 04 01 09 01 09

Which we find in the strxfrm of U+1D400/U+1D40A

 05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
    04       09       09

The strxfrm() of U+1D400 looks like the strxfrm() of a string of two
characters. As if strxfrm considered U+1D400 was the concatenation of something
and "2".

Note that mbtowc() is ok with those characters:

$ ./mbtowc $'\U1D400'
4 0X1D400

Where mbtowc.c is:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(int argc, char* argv[])
{
  wchar_t c;
  int i;

  setlocale(LC_ALL, "");
  for (i = 1; i < argc; i++) {
    int n;
    n = mbtowc(&c, argv[i], strlen(argv[i]));
    printf("%d %#X\n", n, c);
  }
}

It's not limited to those characters. It seems to be the case for many (over
one million) characters whose encoding ends in the encoding of a digit but not
all. For instance, not for U+00C3 (and a few thousand others). It's only for
characters whose encoding ends in the encoding of a digit. I could not
reproduce it with any other character encoding.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]