This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug locale/21302] New: strcoll does not correctly follow locale-specified order in some cases
- From: "lautgesetz at gmail dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Fri, 24 Mar 2017 11:12:57 +0000
- Subject: [Bug locale/21302] New: strcoll does not correctly follow locale-specified order in some cases
- Auto-submitted: auto-generated
https://sourceware.org/bugzilla/show_bug.cgi?id=21302
Bug ID: 21302
Summary: strcoll does not correctly follow locale-specified
order in some cases
Product: glibc
Version: 2.23
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: locale
Assignee: unassigned at sourceware dot org
Reporter: lautgesetz at gmail dot com
Target Milestone: ---
Created attachment 9939
--> https://sourceware.org/bugzilla/attachment.cgi?id=9939&action=edit
test file
Consider the following file sorttest.txt, pre-sorted in Unicode codepoint
order:
!
ズざら
セーリングボートは
モエ
¥
𐀎
𐀘
𐀛
𫛛
𫛞
𫛢
𫛭
𫛶
𫛸
𫟷
𫟼
If I run "LC_COLLATE=C sort sorttest.txt", using the hard-coded C locale, the
output is unchanged -- that is, it is sorted in codepoint order as expected.
However, if I run "LC_COLLATE=C.UTF-8 sort sorttest.txt" on Ubuntu, which uses
a locale file defining collation straightforwardly in the codepoint order, I
get the following unexpected result:
𐀎
𐀘
𐀛
𫛛
𫛞
𫛢
𫛭
𫛶
𫛸
𫟷
𫟼
!
ズざら
セーリングボートは
モエ
¥
To get more detail on what's going on, one can run:
$ LC_ALL=C.UTF-8 sort sorttest.txt | perl -CSAD -ne 'chomp; printf
"%s\tU+%05X\n", $_, ord'
𐀎 U+1000E
𐀘 U+10018
𐀛 U+1001B
𫛛 U+2B6DB
𫛞 U+2B6DE
𫛢 U+2B6E2
𫛭 U+2B6ED
𫛶 U+2B6F6
𫛸 U+2B6F8
𫟷 U+2B7F7
𫟼 U+2B7FC
! U+00021
ズざら U+0FF7D
セーリングボートは U+0FF7E
モエ U+0FF93
¥ U+0FFE5
Another example:
$ perl -CSAD -E 'for my $b (0, 0xF000, 0x10000) { for my $c (0x00, 0x01, 0x21)
{ $_ = $b + $c; printf "%s\tU+%05X\n", chr, $_} }' | LC_COLLATE=C.UTF-8 sort
U+00000
𐀀 U+10000
𐀁 U+10001
𐀡 U+10021
U+00001
! U+00021
U+0F000
U+0F001
U+0F021
The issue looks to be that codepoints above 0xFFFF come before the rest, except
that U+0000 somehow always comes first.
It's definitely not just the "sort" command that's broken. I first noticed this
issue in a PostgreSQL database that was using the C.UTF-8 locale's collation
order. Given the straightforwardness of the locale file in question
(/usr/share/i18n/locales/C on Ubuntu), it's hard to believe the fault lies
outside glibc.
The above commands were tested on Ubuntu 16.04 with glibc 2.23, but the same
issue has been reproduced on earlier and later versions of glibc (2.19, 2.24,
2.25).
--
You are receiving this mail because:
You are on the CC list for the bug.