This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks
- From: "mh-sourceware at glandium dot org" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Sun, 29 Jan 2017 01:13:26 +0000
- Subject: [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks
- Auto-submitted: auto-generated
https://sourceware.org/bugzilla/show_bug.cgi?id=21091
Bug ID: 21091
Summary: Unexpected collation in ja_JP.UTF-8 probably due to
unsupported blocks
Product: glibc
Version: 2.24
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: localedata
Assignee: unassigned at sourceware dot org
Reporter: mh-sourceware at glandium dot org
CC: libc-locales at sourceware dot org
Target Milestone: ---
I was doing some scripting around some subset of the data in
https://github.com/cjkvi/cjkvi-ids/blob/master/ids.txt. I ended up doing things
like sort | uniq -d, both of which use strcoll.
My system locale is ja_JP.UTF-8, and that led to surprising results. I realize
what my intent was actually not to follow collation rules, but that still left
me wondering if collation was right in glibc.
So I created a LD_PRELOAD library that redirects strcoll to ICU's ucol_strcoll
and compared the outputs. They were very different.
So I dug further, and found that:
All characters in CJK Unified Ideographs Extension A are considered equal.
All characters in CJK Unified Ideographs Extension B are considered equal.
All characters in CJK Unified Ideographs Extension C are considered equal.
All characters in CJK Unified Ideographs Extension D are considered equal.
All characters in CJK Unified Ideographs Extension E are considered equal.
All characters in CJK Radicals Supplement are considered equal.
All characters in Kangxi Radicals are considered equal.
All characters in CJK Strokes are considered equal.
All characters in Enclosed CJK Letters and Months are considered equal.
All characters in CJK Compatibility are considered equal.
All characters in CJK Compatibility Ideographs are considered equal.
All characters in CJK Compatibility Forms are considered equal.
All characters in Enclosed Ideographic Supplement are considered equal.
All characters in CJK Compatibility Ideographs Supplement are considered equal.
More than that, all the characters in the blocks above with codepoints below
0x10000 are considered equals, and all the characters in the blocks above with
codepoints above 0x10000 are considered equal.
All in all, it would seem all unsupported characters in the BMP are equal, and
all unsupported characters in other unicode planes are equal.
With new unicode versions adding new characters, it seems to me it would be
better if unsupported characters were considered different as a general rule.
Obviously, it would be better if the above blocks were supported.
(This is with libc 2.24-9 from Debian)
--
You are receiving this mail because:
You are on the CC list for the bug.