This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/21750] New: column width of characters incompatible with classical wcwidth


https://sourceware.org/bugzilla/show_bug.cgi?id=21750

            Bug ID: 21750
           Summary: column width of characters incompatible with classical
                    wcwidth
           Product: glibc
           Version: 2.26
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: tg at mirbsd dot de
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

I’ve compared the new autogenerated column width from
localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
implementation from xterm (adjusted to Unicode 10.0.0) and found a few
divergences (and bugs on my (MirBSD, which uses something based on xterm’s data
system-wide) side, which I fixed).

1. U+00AD is forced to width 1 in xterm, autodetected as combining in glibc

Rationale for forcing it to 1 is likely that U+0000‥U+00FF are latin1, which,
when displayed as 8bit on terminals, had no combining characters at all.

Change Request to glibc: force U+00AD to width 1.

2. The UCD has three codepoints that are Me/Mn category but not NSM bidi class:
U+0CBF U+0CC6 U-00011C3F

This is likely a bug in UCD but can be fixed by glibc treating Me/Mn the same
as Cf/NSM, which I do.

Change Request to glibc: handle Me/Mn category the same as NSM bidi class.

3. Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they
combine on top of the preceding initial ones: U+1160‥U+11FF

Change Request to glibc: force U+1160‥U+11FF to width 0.

4. During parsing, EastAsianWidth data overrides UCD data, more specifically
the NSM property.

This leads to U+302A‥U+302D and – see also
https://sourceware.org/bugzilla/show_bug.cgi?id=19852 – U+3099 and U+309A being
treated as width 2.

Change Request to glibc: read EAW before UCD so the NSM overrides EAW here.

5. Ambiguous circled numbers and neutral hexagrams changed width

xterm used to set those to width 2, likely because they are ideographs and not
unlike zodiac signs and emoji (which, I notice, have been set to width 2 in UCD
nowadays)

Change Request to glibc: force U+3248‥U+324F and U+4DC0‥U+4DFF to width 2.


Note: I’ve initially reported the surprising change to Debian as
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826256 but have redone the
research today (against 2.24 in Debian and git master commit
2a91300176a5991d9825eba085e502196a3f47cd in glibc) against Unicode 10,
double-checked *all* differences against MirBSD code and fixed a few bugs there
after making it possible to compare the results (considering glibc only puts
actually assigned codepoints into the localedata/charmaps/UTF-8 file).

Rationale for requesting the change in glibc is so that all systems I have
access to use the same width data, preventing display artifacts and glitches up
to making an editor somewhat unusable with heavy Unicode (I have test files
containing the entire Unicode range). Thank you for listening.

If necessary, I will provide patches (to utf8_gen.py most likely) when asked.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]