This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/17588] Update UTF-8 charmap and width to Unicode 7.0.0


https://sourceware.org/bugzilla/show_bug.cgi?id=17588

--- Comment #4 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Here is another one where I have a little bit of doubt left:

changed width: 0x1929 : 0->1 eaw=N category=Mc bidi=L   name=LIMBU SUBJOINED
LETTER YA

Why is this combining characters listed with width 0 in the current UTF-8 file?

In our newly generated UTF-8 file it has width 1 (because it is removed from
that  file).

The comment in the existing UTF-8 file in glibc says:

% Character width according to Unicode 5.0.0.
% - Default width is 1.
% - Double-width characters have width 2; generated from
%        "grep '^[^;]*;[WF]' EastAsianWidth.txt"
%   and  "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
% - Non-spacing characters have width 0; generated from PropList.txt or
%   "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
% - Format control characters have width 0; generated from
%   "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
% - Zero width characters have width 0; generated from
%   "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"

This does *not* mention combining characters as needing width 0,
these grep patters to not include some combining characters.

The combining characters with category=Mn get width 0 because the
also have bidi=NSM, for example:

changed width: 0x1a1b : 1->0 eaw=N category=Mn bidi=NSM name=BUGINESE VOWEL
SIGN AE

but the combining characters with category=Mc are not matched by
the above grep patterns, because they do *not* have bidi=NSM.
That seems correct, considering they have a positive advance width:

Mn     Nonspacing_Mark  a nonspacing combining mark (zero advance width)
Mc     Spacing_Mark      a spacing combining mark (positive advance width)
Me     Enclosing_Mark      an enclosing combining mark

(http://www.unicode.org/reports/tr44)

But how did these get into the existing UTF-8 file in glibc?

Looks like the existing UTF-8 file in glibc was edited manually
and not just created using the grep patterns in the comment.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]