This is the mail archive of the
libc-locales@sourceware.org
mailing list for the GNU libc locales project.
[Bug localedata/17588] Update UTF-8 charmap and width to Unicode 7.0.0
- From: "maiku.fabian at gmail dot com" <sourceware-bugzilla at sourceware dot org>
- To: libc-locales at sourceware dot org
- Date: Fri, 21 Nov 2014 12:36:35 +0000
- Subject: [Bug localedata/17588] Update UTF-8 charmap and width to Unicode 7.0.0
- Auto-submitted: auto-generated
- References: <bug-17588-716 at http dot sourceware dot org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=17588
--- Comment #4 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Here is another one where I have a little bit of doubt left:
changed width: 0x1929 : 0->1 eaw=N category=Mc bidi=L name=LIMBU SUBJOINED
LETTER YA
Why is this combining characters listed with width 0 in the current UTF-8 file?
In our newly generated UTF-8 file it has width 1 (because it is removed from
that file).
The comment in the existing UTF-8 file in glibc says:
% Character width according to Unicode 5.0.0.
% - Default width is 1.
% - Double-width characters have width 2; generated from
% "grep '^[^;]*;[WF]' EastAsianWidth.txt"
% and "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
% - Non-spacing characters have width 0; generated from PropList.txt or
% "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
% - Format control characters have width 0; generated from
% "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
% - Zero width characters have width 0; generated from
% "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"
This does *not* mention combining characters as needing width 0,
these grep patters to not include some combining characters.
The combining characters with category=Mn get width 0 because the
also have bidi=NSM, for example:
changed width: 0x1a1b : 1->0 eaw=N category=Mn bidi=NSM name=BUGINESE VOWEL
SIGN AE
but the combining characters with category=Mc are not matched by
the above grep patterns, because they do *not* have bidi=NSM.
That seems correct, considering they have a positive advance width:
Mn Nonspacing_Mark a nonspacing combining mark (zero advance width)
Mc Spacing_Mark a spacing combining mark (positive advance width)
Me Enclosing_Mark an enclosing combining mark
(http://www.unicode.org/reports/tr44)
But how did these get into the existing UTF-8 file in glibc?
Looks like the existing UTF-8 file in glibc was edited manually
and not just created using the grep patterns in the comment.
--
You are receiving this mail because:
You are on the CC list for the bug.