This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/21750] column width of characters incompatible with classical wcwidth


https://sourceware.org/bugzilla/show_bug.cgi?id=21750

--- Comment #2 from Thorsten Glaser <tg at mirbsd dot de> ---
(In reply to Troy Korjuslommi from comment #1)
> Excuse my ignorance, but isn't U+00AD (soft hyphen) usually invisible,
> i.e. zero columns? If an app breaks up words at end-of-line, it can use
> the soft hyphens as helpers to detect the correct locations. The app can

Yes, in theory. This codepoint could be used in the *input data* to
determine soft breaks. However (see below) they should *not* output
those to a terminal emulator (GUIs that handle this themselves are
likely fine).

> I am not suggesting a change, if xterm etc. multitude of apps are
> already handling soft hyphens in some other manner, just wondering.

Similar to U+0060 (the gravis accent 「`」) however, terminal emulators
have been treating both ASCII (for U+0060) and 8-bit codepages like
ISO 8859-1 (for U+00AD) as each (non-control) character having a constant
width of 1 (for SBCS), and xterm’s wcwidth() code had special handling
to force U+00AD to 1:

/*
 […]
 *    - SOFT HYPHEN (U+00AD) has a column width of 1.
 […]
 */
[…]
  /* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */

Source:
http://www.mirbsd.org/cvs.cgi/X11/xc/programs/xterm/wcwidth.c?rev=1.1.103.1;content-type=text%2Fplain


So you’d want to output U+0060 U+0008 U+0061 (` + backspace + a) to get à on a
(printed) terminal (or in code that uses such to emulate them), and similarily,
strip soft hyphens from the output (or manifest them as regular ones) before
outputting a soft-wrapped text (mostly because the terminal emulator will also
not soft-wrap, it’ll break at the end of the line, so you’d convert U+00AD to
some kind of hyphen (hyphen-minus or U+2010 perhaps) followed by a line
break(⚠) if preparing something fopr terminal output).


I’ve noticed the incompatibilities especially when the hexagrams, one of which
I’m using for UI purposes, changed width, and tried to discover all of them in
order to harmonise the width assumptions the various programs I have access to
use on all systems I use, with classical xterm wcwidth.c as base, since those
widths are the domain of a fixed-cell terminal emulator more than something
else (which can use its own data, if necessary).

I do volunteer to provide patches, here and elsewhere, so that, with the same
UCD version as input, we get consistent output (and I’ve sanity-checked the
output I got before opening this report).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]