This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: New GB18030 gconv module contributed by ThizLinux Laboratory
- From: Yu Shao <yshao at redhat dot com>
- To: Anthony Fok <anthony at thizlinux dot com>
- Cc: Ulrich Drepper <drepper at redhat dot com>, libc-alpha at sources dot redhat dot com, kevin at thizlinux dot com, fai at thizlinux dot com, sunnygu at thizgroup dot com, suzhe at gnuchina dot org
- Date: Thu, 17 Jan 2002 22:35:12 +1000
- Subject: Re: New GB18030 gconv module contributed by ThizLinux Laboratory
- References: <20020116074546.GA17279@sunrise> <m36661h44z.fsf@myware.mynet> <20020117100203.GB23149@sunrise>
>
>
>
>Not sure if it is a problem with /usr/bin/iconv or GB18030.so:
>when I tried your module, both old and new, on the Chinese sample
>documents:
>
> $ iconv -f gb18030 -t ucs2 four.txt
> iconv: illegal input sequence at position 32
> $ iconv -f gb18030 -t ucs2 wei.txt
> iconv: illegal input sequence at position 0
> $ iconv -f gb18030 -t ucs2 zang.txt
> iconv: illegal input sequence at position 0
> $ iconv -f gb18030 -t ucs2 wei.txt
> iconv: illegal input sequence at position 0
> $ iconv -f gb18030 -t ucs2 yi.txt
> iconv: illegal input sequence at position 0
>
>If the first line is trimmed, the illegal sequence appears at 27420 for
>four.txt, etc. It appears to me that your tables only cover the bare
>minimum required by the Chinese Standards Committee, but this is not
>quite right. GB18030 is supposed to be like UTF-8: it is an encoding
>that covers the entire repertoire of ISO-10646-1 while remaining
>compatible with GB2312 and GBK. It should be able convert to and from
>all Unicode codepoints, i.e. U+0000..U+D7FF, U+E000..U+FFFF,
>U+10000..U+10FFFF.
>
The character in the postion of 32 of four.txt is 0x8139EE38 whose
unicode is 0x33FF, if you can have a look of unicode table, 0x33FF is a
undefined invalide value. Acutally the same things with those other four
test files. Converting gb code like 0x8139ee38 to a non-exist unicode
really means nothing.
The new GB18030-2000 standard only uses unicode till 0xFFFF, do you have
the latest standard book? And I think doing all gb18030 stuff based on
the new standard is better.
>
>Ah, I see why now. There are lots of 0x0000 and "\x00\x00" in your code
>which you put in to keep the filesize of the resulting GB18030.so to a
>minimum. If you intend to use AbiWord as the editor for Chinese testing
>agency to test GB18030 support, the problem of this approach will be
>apparent. When the current AbiWord sees a character it cannot find the
>UCS2 from the system iconv, it simply throws it away without even
leaving an
>empty space as a spaceholder. Thus, the on-screen display and printout
>would be different from what appears on Windows 2000/XP+GB18030 which
>the agency use as a guideline. Granted, this is partially AbiWord's
fault,
>but nevertheless, I think glibc should be able to do better
>than just the minimum compliance.
>
For your Abiword problem, I would suggest you update the test files,
like deleting all the invalid values or simply make Abiword more robust.
Regards,
Yu Shao