This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: New GB18030 gconv module contributed by ThizLinux Laboratory

From: Yu Shao <yshao at redhat dot com>
To: Anthony Fok <anthony at thizlinux dot com>
Cc: Ulrich Drepper <drepper at redhat dot com>, libc-alpha at sources dot redhat dot com, kevin at thizlinux dot com, fai at thizlinux dot com, sunnygu at thizgroup dot com, suzhe at gnuchina dot org
Date: Thu, 17 Jan 2002 22:35:12 +1000
Subject: Re: New GB18030 gconv module contributed by ThizLinux Laboratory
References: <20020116074546.GA17279@sunrise> <m36661h44z.fsf@myware.mynet> <20020117100203.GB23149@sunrise>

 >
 >
 >
 >Not sure if it is a problem with /usr/bin/iconv or GB18030.so:
 >when I tried your module, both old and new, on the Chinese sample
 >documents:
 >
 >	$ iconv -f gb18030 -t ucs2 four.txt
 >	iconv: illegal input sequence at position 32
 >	$ iconv -f gb18030 -t ucs2 wei.txt
 >	iconv: illegal input sequence at position 0
 >	$ iconv -f gb18030 -t ucs2 zang.txt
 >	iconv: illegal input sequence at position 0
 >	$ iconv -f gb18030 -t ucs2 wei.txt
 >	iconv: illegal input sequence at position 0
 >	$ iconv -f gb18030 -t ucs2 yi.txt
 >	iconv: illegal input sequence at position 0
 >
 >If the first line is trimmed, the illegal sequence appears at 27420 for
 >four.txt, etc.  It appears to me that your tables only cover the bare
 >minimum required by the Chinese Standards Committee, but this is not
 >quite right.  GB18030 is supposed to be like UTF-8: it is an encoding
 >that covers the entire repertoire of ISO-10646-1 while remaining
 >compatible with GB2312 and GBK.  It should be able convert to and from
 >all Unicode codepoints, i.e. U+0000..U+D7FF, U+E000..U+FFFF,
 >U+10000..U+10FFFF.
 >
The character in the postion of 32 of four.txt is 0x8139EE38 whose
unicode is 0x33FF, if you can have a look of unicode table, 0x33FF is a
undefined invalide value. Acutally the same things with those other four
test files. Converting gb code like 0x8139ee38 to a  non-exist unicode
really means nothing.

The new GB18030-2000 standard only uses unicode till 0xFFFF, do you have
the latest standard book? And I think doing all gb18030 stuff based on
the new standard is better.

 >
 >Ah, I see why now.  There are lots of 0x0000 and "\x00\x00" in your code
 >which you put in to keep the filesize of the resulting GB18030.so to a
 >minimum.  If you intend to use AbiWord as the editor for Chinese testing
 >agency to test GB18030 support, the problem of this approach will be
 >apparent.  When the current AbiWord sees a character it cannot find the
 >UCS2 from the system iconv, it simply throws it away without even 
leaving an
 >empty space as a spaceholder.  Thus, the on-screen display and printout
 >would be different from what appears on Windows 2000/XP+GB18030 which
 >the agency use as a guideline.  Granted, this is partially AbiWord's 
fault,
 >but nevertheless, I think glibc should be able to do better
 >than just the minimum compliance.
 >
For your Abiword problem, I would suggest you update the test files,
like deleting all the invalid values or simply make Abiword more robust.

Regards,

Yu Shao

Follow-Ups:
- Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)
  - From: Anthony Fok

References:
- New GB18030 gconv module contributed by ThizLinux Laboratory
  - From: Anthony Fok
- Re: New GB18030 gconv module contributed by ThizLinux Laboratory
  - From: Ulrich Drepper
- Re: New GB18030 gconv module contributed by ThizLinux Laboratory
  - From: Anthony Fok

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]