This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/21750] column width of characters incompatible with classical wcwidth


https://sourceware.org/bugzilla/show_bug.cgi?id=21750

--- Comment #1 from Troy Korjuslommi <tjk at tksoft dot com> ---
Excuse my ignorance, but isn't U+00AD (soft hyphen) usually invisible,
i.e. zero columns? If an app breaks up words at end-of-line, it can use
the soft hyphens as helpers to detect the correct locations. The app can
then add a visible hyphen to the end of the line. (If the app also reads
from the terminal, then it can e.g. ignore visible hyphens when preceded
by a soft hyphen, or use some other mechanism to mark the character as
for terminal display only).

I am not suggesting a change, if xterm etc. multitude of apps are
already handling soft hyphens in some other manner, just wondering.

Troy



On Tue, 2017-07-11 at 14:18 +0000, tg at mirbsd dot de wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=21750
> 
>             Bug ID: 21750
>            Summary: column width of characters incompatible with classical
>                     wcwidth
>            Product: glibc
>            Version: 2.26
>             Status: UNCONFIRMED
>           Severity: normal
>           Priority: P2
>          Component: localedata
>           Assignee: unassigned at sourceware dot org
>           Reporter: tg at mirbsd dot de
>                 CC: libc-locales at sourceware dot org
>   Target Milestone: ---
> 
> I’ve compared the new autogenerated column width from
> localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
> implementation from xterm (adjusted to Unicode 10.0.0) and found a few
> divergences (and bugs on my (MirBSD, which uses something based on xterm’s data
> system-wide) side, which I fixed).
> 
> 1. U+00AD is forced to width 1 in xterm, autodetected as combining in glibc
> 
> Rationale for forcing it to 1 is likely that U+0000‥U+00FF are latin1, which,
> when displayed as 8bit on terminals, had no combining characters at all.
> 
> Change Request to glibc: force U+00AD to width 1.
> 
> 2. The UCD has three codepoints that are Me/Mn category but not NSM bidi class:
> U+0CBF U+0CC6 U-00011C3F
> 
> This is likely a bug in UCD but can be fixed by glibc treating Me/Mn the same
> as Cf/NSM, which I do.
> 
> Change Request to glibc: handle Me/Mn category the same as NSM bidi class.
> 
> 3. Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they
> combine on top of the preceding initial ones: U+1160‥U+11FF
> 
> Change Request to glibc: force U+1160‥U+11FF to width 0.
> 
> 4. During parsing, EastAsianWidth data overrides UCD data, more specifically
> the NSM property.
> 
> This leads to U+302A‥U+302D and – see also
> https://sourceware.org/bugzilla/show_bug.cgi?id=19852 – U+3099 and U+309A being
> treated as width 2.
> 
> Change Request to glibc: read EAW before UCD so the NSM overrides EAW here.
> 
> 5. Ambiguous circled numbers and neutral hexagrams changed width
> 
> xterm used to set those to width 2, likely because they are ideographs and not
> unlike zodiac signs and emoji (which, I notice, have been set to width 2 in UCD
> nowadays)
> 
> Change Request to glibc: force U+3248‥U+324F and U+4DC0‥U+4DFF to width 2.
> 
> 
> Note: I’ve initially reported the surprising change to Debian as
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826256 but have redone the
> research today (against 2.24 in Debian and git master commit
> 2a91300176a5991d9825eba085e502196a3f47cd in glibc) against Unicode 10,
> double-checked *all* differences against MirBSD code and fixed a few bugs there
> after making it possible to compare the results (considering glibc only puts
> actually assigned codepoints into the localedata/charmaps/UTF-8 file).
> 
> Rationale for requesting the change in glibc is so that all systems I have
> access to use the same width data, preventing display artifacts and glitches up
> to making an editor somewhat unusable with heavy Unicode (I have test files
> containing the entire Unicode range). Thank you for listening.
> 
> If necessary, I will provide patches (to utf8_gen.py most likely) when asked.
>

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]