This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?


https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #9 from Thorsten Glaser <tg at mirbsd dot de> ---
(In reply to Mike Frysinger from comment #8)

> > • 0 is for combining characters and NUL only
> 
> that is incorrect.  you mishandle Prepended_Concatenation_Mark (see bug
> 22070), and ignore Format Character (Cf) characters which are all 0 (or
> you're incorrectly claiming that Cf's are not combining characters).  and

OK, sorry about that. But xterm handles even those as such, basically
it combines the glyph for it (could be blank or just the dotted square)
over the preceding character, as they have no meaning for a terminal.


> > • compatibility with previous/older/other wcwidth() implementations, most
> > importantly
> 
> appealing to historical wcwidth behavior isn't a great argument.  ones

But this is more important than you make it sound.

> written to older Unicode standards

Sure, which is why I updated it to use the current Unicode data
as base, but there are a few cases which were specifically handled
explicitly different right from the start, and, with the changes
I described, mfabian’s code in glibc and mine in MirBSD come to
the same result modulo implementation differences.

(I also handle Prepended_Concatenation_Mark in MirBSD now in the
way you requested in bz#22070, so compatibility goes both ways.
My focus was on updating mgk25’s code in a compatible way, as to
not introduce any regressions; changes from later Unicode changes
are welcome, as are initial oversights such as this one (if it
existed back then), but as I said, U+00AD was special-handled
right from the beginning.)

> > • The char should be avoided already *anyway*
> > • Terminal emulators never implement wrapping at a “possible soft hyphen”,
> > only at the end of the line
> 
> then by your own argument, having it follow the Unicode standard is a

There is no Unicode standard for wcwidth().

> non-issue

It’s not because with 0, applications displaying a simple charmap
for the first page (i.e. latin1) fail on X'AD'.

> if your terminal and the target application disagree about encoding then
> you've already lost.  everything above 0x7F will be wrong (0x80 != U+0080 or
> 0xc2 0x80).

You did not understand what I wrote.

Tools like GNU screen and XFree86® luit can convert between the encodings,
so they’d convert an \xA0 from the program (meaning an 0x80 in latin1) to
a U+00A0 internally to a \xC2\xA0 in UTF-8 to the screen, and back.

The *definition* of these mappings maps 0xAD from latin1 to U+00AD, not to
U+002D. (Changing _this_ would also be unwise as there’d be no way to type
latin1 0xAD any more.)

Therefore, wcwidth(U+00AD) should stay at 1.

PS: Discussing this is really straining for me, and English is only my third
non-programming language, so please read anything weird as I mean it, not as I
formulated it.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]