This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks


https://sourceware.org/bugzilla/show_bug.cgi?id=21091

            Bug ID: 21091
           Summary: Unexpected collation in ja_JP.UTF-8 probably due to
                    unsupported blocks
           Product: glibc
           Version: 2.24
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: mh-sourceware at glandium dot org
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

I was doing some scripting around some subset of the data in
https://github.com/cjkvi/cjkvi-ids/blob/master/ids.txt. I ended up doing things
like sort | uniq -d, both of which use strcoll.

My system locale is ja_JP.UTF-8, and that led to surprising results. I realize
what my intent was actually not to follow collation rules, but that still left
me wondering if collation was right in glibc.

So I created a LD_PRELOAD library that redirects strcoll to ICU's ucol_strcoll
and compared the outputs. They were very different.

So I dug further, and found that:

All characters in CJK Unified Ideographs Extension A are considered equal.
All characters in CJK Unified Ideographs Extension B are considered equal.
All characters in CJK Unified Ideographs Extension C are considered equal.
All characters in CJK Unified Ideographs Extension D are considered equal.
All characters in CJK Unified Ideographs Extension E are considered equal.
All characters in CJK Radicals Supplement are considered equal.
All characters in Kangxi Radicals are considered equal.
All characters in CJK Strokes are considered equal.
All characters in Enclosed CJK Letters and Months are considered equal.
All characters in CJK Compatibility are considered equal.
All characters in CJK Compatibility Ideographs are considered equal.
All characters in CJK Compatibility Forms are considered equal.
All characters in Enclosed Ideographic Supplement are considered equal.
All characters in CJK Compatibility Ideographs Supplement are considered equal.

More than that, all the characters in the blocks above with codepoints below
0x10000 are considered equals, and all the characters in the blocks above with
codepoints above 0x10000 are considered equal.

All in all, it would seem all unsupported characters in the BMP are equal, and
all unsupported characters in other unicode planes are equal.

With new unicode versions adding new characters, it seems to me it would be
better if unsupported characters were considered different as a general rule.

Obviously, it would be better if the above blocks were supported.

(This is with libc 2.24-9 from Debian)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]