This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug locale/21302] New: strcoll does not correctly follow locale-specified order in some cases

From: "lautgesetz at gmail dot com" <sourceware-bugzilla at sourceware dot org>
To: glibc-bugs at sourceware dot org
Date: Fri, 24 Mar 2017 11:12:57 +0000
Subject: [Bug locale/21302] New: strcoll does not correctly follow locale-specified order in some cases
Auto-submitted: auto-generated

https://sourceware.org/bugzilla/show_bug.cgi?id=21302

            Bug ID: 21302
           Summary: strcoll does not correctly follow locale-specified
                    order in some cases
           Product: glibc
           Version: 2.23
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: lautgesetz at gmail dot com
  Target Milestone: ---

Created attachment 9939
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9939&action=edit
test file

Consider the following file sorttest.txt, pre-sorted in Unicode codepoint
order:

!
ｽﾞざら
ｾｰﾘﾝｸﾞﾎﾞｰﾄは
ﾓｴ
￥
𐀎
𐀘
𐀛
𫛛
𫛞
𫛢
𫛭
𫛶
𫛸
𫟷
𫟼

If I run "LC_COLLATE=C sort sorttest.txt", using the hard-coded C locale, the
output is unchanged -- that is, it is sorted in codepoint order as expected.
However, if I run "LC_COLLATE=C.UTF-8 sort sorttest.txt" on Ubuntu, which uses
a locale file defining collation straightforwardly in the codepoint order, I
get the following unexpected result:

𐀎
𐀘
𐀛
𫛛
𫛞
𫛢
𫛭
𫛶
𫛸
𫟷
𫟼
!
ｽﾞざら
ｾｰﾘﾝｸﾞﾎﾞｰﾄは
ﾓｴ
￥

To get more detail on what's going on, one can run:

$ LC_ALL=C.UTF-8 sort sorttest.txt | perl -CSAD -ne 'chomp; printf
"%s\tU+%05X\n", $_, ord'
𐀎       U+1000E
𐀘       U+10018
𐀛       U+1001B
𫛛       U+2B6DB
𫛞       U+2B6DE
𫛢       U+2B6E2
𫛭       U+2B6ED
𫛶       U+2B6F6
𫛸       U+2B6F8
𫟷       U+2B7F7
𫟼       U+2B7FC
!       U+00021
ｽﾞざら    U+0FF7D
ｾｰﾘﾝｸﾞﾎﾞｰﾄは     U+0FF7E
ﾓｴ      U+0FF93
￥       U+0FFE5

Another example:

$ perl -CSAD -E 'for my $b (0, 0xF000, 0x10000) { for my $c (0x00, 0x01, 0x21)
{ $_ = $b + $c; printf "%s\tU+%05X\n", chr, $_} }' | LC_COLLATE=C.UTF-8 sort

        U+00000
𐀀       U+10000
𐀁       U+10001
𐀡       U+10021
        U+00001
!       U+00021
       U+0F000
       U+0F001
       U+0F021

The issue looks to be that codepoints above 0xFFFF come before the rest, except
that U+0000 somehow always comes first.

It's definitely not just the "sort" command that's broken. I first noticed this
issue in a PostgreSQL database that was using the C.UTF-8 locale's collation
order. Given the straightforwardness of the locale file in question
(/usr/share/i18n/locales/C on Ubuntu), it's hard to believe the fault lies
outside glibc. 

The above commands were tested on Ubuntu 16.04 with glibc 2.23, but the same
issue has been reproduced on earlier and later versions of glibc (2.19, 2.24,
2.25).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Follow-Ups:
- [Bug locale/21302] strcoll does not correctly follow locale-specified order in some cases
  - From: lautgesetz at gmail dot com
- [Bug locale/21302] strcoll does not correctly follow locale-specified order in some cases
  - From: lautgesetz at gmail dot com
- [Bug locale/21302] strcoll does not correctly follow locale-specified order in some cases
  - From: carlos at redhat dot com
- [Bug locale/21302] strcoll does not correctly follow locale-specified order in some cases
  - From: lautgesetz at gmail dot com
- [Bug locale/21302] strcoll does not correctly follow locale-specified order in some cases
  - From: carlos at redhat dot com
- [Bug locale/21302] strcoll does not correctly follow locale-specified order in some cases
  - From: lautgesetz at gmail dot com
- [Bug locale/21302] strcoll does not correctly follow locale-specified order in some cases
  - From: lautgesetz at gmail dot com
- [Bug locale/21302] strcoll does not correctly follow locale-specified order in some cases
  - From: carlos at redhat dot com
- [Bug locale/21302] strcoll does not correctly follow locale-specified order in some cases
  - From: carlos at redhat dot com

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]