This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?


https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #2 from Troy Korjuslommi <tjk at tksoft dot com> ---
I reached a totally different conclusion from reading those links and
thinking of the wcwidth(SHY) situation for wcwidth().

When writing a curses/terminfo (terminal) application, one goes through
input and determines the width of text by iterating through the input
characters. If a word contains multiple U+00AD characters, at the end of
the line or not, the total width of the word ends up wrong if wcwidth is
set to 1. Therefore wcwidth(U+00AD) should return 0.

Also, using a SHY (U+00AD) character as a rendering hint seems to make
sense, since if a word is broken up with SHY characters, then a SHY
aware application can determine where to break the word, adding a
visible hyphen only at that position. A SHY non-aware application can
just ignore the SHY.

The Korpela article shed light on the confusion standard writers have
had with the issue. It seems clear to me that their intention has been
to add a character which can be used as a hint for breaking words
according to hyphenation rules. The imprecise wording used for
describing the solution has led to the current confusion. We should get
past the semantics of the standards' phrases and focus on the intent,
which is to allow authors to add hyphenation hints to text. 


Troy




On Sun, 2017-09-03 at 20:42 +0000, vapier at gentoo dot org wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=22073
> 
>             Bug ID: 22073
>            Summary: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ?
>            Product: glibc
>            Version: 2.26
>             Status: NEW
>           Severity: normal
>           Priority: P2
>          Component: localedata
>           Assignee: unassigned at sourceware dot org
>           Reporter: vapier at gentoo dot org
>                 CC: egmont at gmail dot com, libc-locales at sourceware dot org,
>                     maiku.fabian at gmail dot com, tg at mirbsd dot de
>         Depends on: 21750
>   Target Milestone: ---
> 
> +++ This bug was initially created as a clone of Bug #21750 +++
> 
> I’ve compared the new autogenerated column width from
> localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
> implementation from xterm (adjusted to Unicode 10.0.0) and found a few
> divergences (and bugs on my (MirBSD, which uses something based on xterm’s data
> system-wide) side, which I fixed).
> 
> U+00AD is forced to width 1 in xterm, autodetected as combining in glibc
> 
> Rationale for forcing it to 1 is likely that U+0000‥U+00FF are latin1, which,
> when displayed as 8bit on terminals, had no combining characters at all.
> 
> Change Request to glibc: force U+00AD to width 1.
> 
> more background discussion with different standards can be found here:
>   https://www.cs.tut.fi/~jkorpela/shy.html
> 
> 
> Referenced Bugs:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=21750
> [Bug 21750] column width of characters incompatible with classical wcwidth

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]