This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug localedata/19575] Status of GB18030 tables
- From: "carlos at redhat dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Mon, 08 Feb 2016 14:47:00 +0000
- Subject: [Bug localedata/19575] Status of GB18030 tables
- Auto-submitted: auto-generated
- References: <bug-19575-131 at http dot sourceware dot org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=19575
Carlos O'Donell <carlos at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |carlos at redhat dot com
--- Comment #6 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Andreas Schwab from comment #5)
> ICU says that these characters are not roundtrip mappings.
>
> <UE78D> \xA6\xD9 |1
> <UE78E> \xA6\xDA |1
> <UE78F> \xA6\xDB |1
> <UE790> \xA6\xDC |1
> <UE791> \xA6\xDD |1
> <UE792> \xA6\xDE |1
> <UE793> \xA6\xDF |1
> <UE794> \xA6\xEC |1
> <UE795> \xA6\xED |1
> <UE796> \xA6\xF3 |1
>
> Most likely the characters did not exist yet in Unicode in 2012.
They are not roundtrip, I assume, because those are PUA code points.
The GB 18030-2005 standard still-uses some PUA code points for some idiograms.
The glibc non-PUA code point usage (which differ from the published standard)
is correct for GB 18030-2005 compliance. The PUA code points, in Unicode 4.1 or
newer, can be used as non-PUA equivalents. It is highly recommended that the
Unicode 4.1 code-points be used for anyone mapping GB 18030-2005 to UTF-8 and
is best-practice as documented in "CJKV Processing" by Dr. Ken Lunde.
The ICU implementation complies with the old GB 18030-2000 standard, and does
not use the newer Unicode 4.1 equivalent code points. My opinion is that this
is simply a bug in ICU and Emacs, both should get update.
I feel like we need to install some explanatory patch like this in glibc:
diff --git a/localedata/charmaps/GB18030 b/localedata/charmaps/GB18030
index 863a123..c48276e 100644
--- a/localedata/charmaps/GB18030
+++ b/localedata/charmaps/GB18030
@@ -57234,6 +57234,12 @@ CHARMAP
<UE78A> /xa6/xbe <Private Use>
<UE78B> /xa6/xbf <Private Use>
<UE78C> /xa6/xc0 <Private Use>
+% The newest GB 18030-2005 standard still uses some private use area
+% code points. Any implementation which has Unicode 4.1 or newer
+% support should not use these PUA code points, and instead should
+% map these entries to their equivalent non-PUA code points which
+% in this case map from <UFE10> to <UFE19>. This recommendation is
+% based on "CJKV Processing" by Dr. Ken Lunde.
% <UE78D> /xa6/xd9 <Private Use>
% <UE78E> /xa6/xda <Private Use>
% <UE78F> /xa6/xdb <Private Use>
@@ -62997,6 +63003,10 @@ CHARMAP
<UFE0D> /x84/x31/x82/x33 VARIATION SELECTOR-14
<UFE0E> /x84/x31/x82/x34 VARIATION SELECTOR-15
<UFE0F> /x84/x31/x82/x35 VARIATION SELECTOR-16
+% The code points from <UFE10> to <UFE19> are a adjustment
+% of the GB 18030-2005 standard to account for the fact that
+% with Unicode 4.1 support we can now correctly represent those
+% entries, which in the standard, used PUA code points.
<UFE10> /xa6/xd9 PRESENTATION FORM FOR VERTICAL COMMA
<UFE11> /xa6/xdb PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
<UFE12> /xa6/xda PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL
STOP
I've reached out to Dr. Ken Lunde to clarify if this is correct.
--
You are receiving this mail because:
You are on the CC list for the bug.