This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug locale/22889] New: strcoll/strxfrm broken for most characters in GB18030
- From: "stephane.chazelas+sourceware at gmail dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Sat, 24 Feb 2018 20:52:18 +0000
- Subject: [Bug locale/22889] New: strcoll/strxfrm broken for most characters in GB18030
- Auto-submitted: auto-generated
https://sourceware.org/bugzilla/show_bug.cgi?id=22889
Bug ID: 22889
Summary: strcoll/strxfrm broken for most characters in GB18030
Product: glibc
Version: 2.26
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: locale
Assignee: unassigned at sourceware dot org
Reporter: stephane.chazelas+sourceware at gmail dot com
Target Milestone: ---
In the en_GB.UTF-8 locale
$ perl -C -le 'print chr$_ for 0x1D400..0x1D419' | sort -u
𝐀
That's not fine (bug18927), but expected. The order of those
𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙 mathematical letter characters is not defined and
sort the same.
Now, in a zh_CN.gb18030 locale:
$ LC_ALL=zh_CN.gb18030 locale charmap
GB18030
$ perl -C -le 'print chr$_ for 0x1D400..0x1D419' |
(export LC_ALL=zh_CN.gb18030; iconv -f utf-8 | sort -u | iconv -t utf-8)
𝐈
𝐉
𝐀
𝐁
𝐂
𝐃
𝐄
𝐅
𝐆
𝐇
(where sort is GNU sort which uses strcoll). If we look at the strxfrm() output
of the first few letters, we see:
$ export LC_ALL=zh_CN.gb18030
$ ./strxfrm $'\U1D400' | od -An -vtx1
05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
$ ./strxfrm $'\U1D401' | od -An -vtx1
05 05 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
[...]
$ ./strxfrm $'\U1D409' | od -An -vtx1
05 03 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
$ ./strxfrm $'\U1D40a' | od -An -vtx1
05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
$ ./strxfrm $'\U1D40b' | od -An -vtx1
05 05 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
where strxfrm.c is
#include <locale.h>
#include <string.h>
#include <stdio.h>
int main(int argc, char* argv[])
{
char buf[4096];
setlocale(LC_ALL, "");
strxfrm(buf, argv[1], sizeof(buf));
printf("%s", buf);
return 0;
}
There are 10 different strxfrm() outcomes until it loops back to the beginning.
If we look at those characters:
$ printf '\U1D400' | od -An -vtx1 -vtc
94 33 8a 32
224 3 212 2
$ printf '\U1D401' | od -An -vtx1 -vtc
94 33 8a 33
224 3 212 3
$ printf '\U1D40a' | od -An -vtx1 -vtc
94 33 8b 32
224 3 213 2
See how the last byte of both U+1D400 and U+1D40A is 0x32, the encoding of "2".
The strxfrm of "2" is:
$ ./strxfrm 2 | od -An -vtx1
04 01 09 01 09
Which we find in the strxfrm of U+1D400/U+1D40A
05 04 01 09 09 01 09 09 01 01 c2 b6 02 c2 b6
04 09 09
The strxfrm() of U+1D400 looks like the strxfrm() of a string of two
characters. As if strxfrm considered U+1D400 was the concatenation of something
and "2".
Note that mbtowc() is ok with those characters:
$ ./mbtowc $'\U1D400'
4 0X1D400
Where mbtowc.c is:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>
int main(int argc, char* argv[])
{
wchar_t c;
int i;
setlocale(LC_ALL, "");
for (i = 1; i < argc; i++) {
int n;
n = mbtowc(&c, argv[i], strlen(argv[i]));
printf("%d %#X\n", n, c);
}
}
It's not limited to those characters. It seems to be the case for many (over
one million) characters whose encoding ends in the encoding of a digit but not
all. For instance, not for U+00C3 (and a few thousand others). It's only for
characters whose encoding ends in the encoding of a digit. I could not
reproduce it with any other character encoding.
--
You are receiving this mail because:
You are on the CC list for the bug.