This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/19575] Status of GB18030 tables


https://sourceware.org/bugzilla/show_bug.cgi?id=19575

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carlos at redhat dot com

--- Comment #6 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Andreas Schwab from comment #5)
> ICU says that these characters are not roundtrip mappings.
> 
> <UE78D> \xA6\xD9 |1
> <UE78E> \xA6\xDA |1
> <UE78F> \xA6\xDB |1
> <UE790> \xA6\xDC |1
> <UE791> \xA6\xDD |1
> <UE792> \xA6\xDE |1
> <UE793> \xA6\xDF |1
> <UE794> \xA6\xEC |1
> <UE795> \xA6\xED |1
> <UE796> \xA6\xF3 |1
> 
> Most likely the characters did not exist yet in Unicode in 2012.

They are not roundtrip, I assume, because those are PUA code points.

The GB 18030-2005 standard still-uses some PUA code points for some idiograms.
The glibc non-PUA code point usage (which differ from the published standard)
is correct for GB 18030-2005 compliance. The PUA code points, in Unicode 4.1 or
newer, can be used as non-PUA equivalents. It is highly recommended that the
Unicode 4.1 code-points be used for anyone mapping GB 18030-2005 to UTF-8 and
is best-practice as documented in "CJKV Processing" by Dr. Ken Lunde.

The ICU implementation complies with the old GB 18030-2000 standard, and does
not use the newer Unicode 4.1 equivalent code points. My opinion is that this
is simply a bug in ICU and Emacs, both should get update.

I feel like we need to install some explanatory patch like this in glibc:

diff --git a/localedata/charmaps/GB18030 b/localedata/charmaps/GB18030
index 863a123..c48276e 100644
--- a/localedata/charmaps/GB18030
+++ b/localedata/charmaps/GB18030
@@ -57234,6 +57234,12 @@ CHARMAP
 <UE78A>     /xa6/xbe         <Private Use>
 <UE78B>     /xa6/xbf         <Private Use>
 <UE78C>     /xa6/xc0         <Private Use>
+% The newest GB 18030-2005 standard still uses some private use area
+% code points.  Any implementation which has Unicode 4.1 or newer
+% support should not use these PUA code points, and instead should
+% map these entries to their equivalent non-PUA code points which
+% in this case map from <UFE10> to <UFE19>.  This recommendation is
+% based on "CJKV Processing" by Dr. Ken Lunde.
 % <UE78D>     /xa6/xd9         <Private Use>
 % <UE78E>     /xa6/xda         <Private Use>
 % <UE78F>     /xa6/xdb         <Private Use>
@@ -62997,6 +63003,10 @@ CHARMAP
 <UFE0D>     /x84/x31/x82/x33 VARIATION SELECTOR-14
 <UFE0E>     /x84/x31/x82/x34 VARIATION SELECTOR-15
 <UFE0F>     /x84/x31/x82/x35 VARIATION SELECTOR-16
+% The code points from <UFE10> to <UFE19> are a adjustment
+% of the GB 18030-2005 standard to account for the fact that
+% with Unicode 4.1 support we can now correctly represent those
+% entries, which in the standard, used PUA code points.
 <UFE10>     /xa6/xd9         PRESENTATION FORM FOR VERTICAL COMMA
 <UFE11>     /xa6/xdb         PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
 <UFE12>     /xa6/xda         PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL
STOP

I've reached out to Dr. Ken Lunde to clarify if this is correct.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]