This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Dear all, >That's great! Thank you! :-) BTW, I don't know if Andrew agrees with >the naming of these two encodings, because he clarified their >distinctions about HKSCS-1999 and HKSCS-2001 to the mailing list after >my posting. I guess calling it "BIG5-HKSCS-1999" is a misnomer. >In big5-iso.txt, the two are called: > > HKSCS-2001 in ISO/IEC 10646-1:2000 >and > HKSCS-2001 in ISO/IEC 10646-2:2001 I am not sure whether you guys are implementing "HKSCS-2001 in ISO/IEC 10646-1:2000" or "HKSCS (released in 1999) in ISO/IEC 10646-1:2000" using the name "BIG5-HKSCS-1999". As I mentioned in an earlier mail, we would definitely like to have "HKSCS-2001 in ISO/IEC 10646-1:2000" rather than "HKSCS (released in 1999) in ISO/IEC 10646-1:2000". If you guys will actually implement "HKSCS-2001 in ISO/IEC 10646-1:2000" using the name "BIG5-HKSCS-1999", then the name may mislead people to think that "BIG5-HKSCS-1999" only supports the characters defined by the HKSCS released in 1999. The names "BIG5-HKSCS-1999" and "BIG5-HKSCS-2001" gives no indication in how the HKSCS are mapped to UCS-4. Is it possible to reflect the way of mapping in the name? >So, practically, BIG5-HKSCS is CP950 + HKSCS, with the end result being >(almost strictly): Big5-1984 < CP950 < Big5-ETen < Big5-HKSCS > >Nevertheless, it would be best if we can get a clarification from >Andrew on this. (Many thanks, Andrew! :-) I think this understanding is correct. And as Anthony and James have suggested, implementing the mapping according to CP950 seems more reasonable as this would enhance the data compatibility between Microsoft platforms and Linux. Rgds From: Anthony Fok <anthony@thizlinux.com> on 2002/05/14 10:56 AM To: Bruno Haible <bruno@clisp.org> cc: libc-alpha@sources.redhat.com, James Su <suzhe@turbolinux.com.cn>, Roger So <roger.so@sw-linux.com>, Andrew TC Fung/ITSD/HKSARG@ITSD Subject: Re: Unicode 3.2 support (6) On Mon, May 13, 2002 at 01:23:36PM +0200, Bruno Haible wrote: > Anthony Fok writes: > > In our big5hkscs.c tables ("BIG5-HKSCS-1999"), the to_unicode function maps > > to BMP+PUA, whereas from_unicode maps from BMP+PUA+CJK_ExtB back to Big5 > > (i.e. quite a few characters have two-to-one mappings in the from_unicode > > direction). It would be best if this two-to-one mappings in the > > from_unicode direction be kept in both BIG5-HKSCS-2001. > > I agree. For user's convenience it is best if the from_unicode > direction of both BIG5-HKSCS-1999 and BIG5-HKSCS-2001 is identical. > In other words, each of the two from_unicode converters will then > accept Unicode text that has been converted by either one of two > to_unicode converters. That's great! Thank you! :-) BTW, I don't know if Andrew agrees with the naming of these two encodings, because he clarified their distinctions about HKSCS-1999 and HKSCS-2001 to the mailing list after my posting. I guess calling it "BIG5-HKSCS-1999" is a misnomer. In big5-iso.txt, the two are called: HKSCS-2001 in ISO/IEC 10646-1:2000 and HKSCS-2001 in ISO/IEC 10646-2:2001 I wonder how we should call them. :-) > > Andrew Fung of ITSD explained that to me. big5cmp.txt is mainly for > > compatibility with old documents using the GCCS (1995) (Government Common > > Character Set, which predates HKSCS. > > > > Andrew replied: ... > > > > So, based on Andrew's recommendation, and since GCCS is obsolete, > > I think we should go with "Option 4" which would in effect normalize > > documents with GCCS encoding to HKSCS encoding. > > Thanks for the explanations. I'm implementing this suggestion in both > converters. Great! Many thanks! :-) > > such big5hkscs.c tables have already been made by both James and I, > > so you can use one or the other to save you some time. > > I cannot take your converters as-is, because > 1) they contain redundant private area mappings, for example > B5+8140 -> U+EEB8, which is nowhere official. They are not redundant, as there is only one mapping to and fro that area. They are defined in CP950. And supposedly, Big5-ETen and BIG5-HKSCS are both supersets of CP950, they should also contain this area. (i.e. it would be best if this area is added to glibc's BIG5 table too. I should bring up a discussion on CLE.) :-) This area (B5+8140 - B5+84FE) is part of UDA3 (B5+8140 - B5-8DFE) in HKSCS-2001. True, 8140-84FE is reserved for end users and will not be used by future extensions of HKSCS-2001, and thus it is important that there is still a mapping from 8140-84FE to an area in Unicode's PUA. The BIG5-HKSCS table only lists what they add above "Big5", and by "Big5", the ITSD doesn't define explicitly, but we can safely assume it to be CP950, because the ITSD has provided first implementations of HKSCS on Windows, which contains the 8140-84FE mapping. Thus, it is important for the HKSCS table in glibc to do the same. So, practically, BIG5-HKSCS is CP950 + HKSCS, with the end result being (almost strictly): Big5-1984 < CP950 < Big5-ETen < Big5-HKSCS Nevertheless, it would be best if we can get a clarification from Andrew on this. (Many thanks, Andrew! :-) > 2) they are apparently based on CP950, not BIG5. For example they map > B5+A1C5 to U+02CD, but according to page 108 of e_hkscs.pdf the > character U+02CD is not part of BIG5-HKSCS. Try this: echo 'กล' | iconv -f big5 -t ucs2 | hexdump This also gives U+02CD. This means it is not extraneous, as it is in the Big5 encoding submitted by the CLE too. ;-) > 3) they contain additional stuff, like B5+8C40 -> U+503B, which > is not found in page 109 of e_hkscs.pdf. They are in the official big5-iso.txt (HKSCS-2001 version) provided on the HKSCS-2001 web site, and it is on page 3-116 (or, in Acrobat Reader, page 176 of 287) of e_hkscs.pdf. :-) > > Or, for that matter, what is the CHARMAP for exactly? (I just know > > that there is an ISO Technical Report final draft (14632 or > > something like that?) about this and other locale stuff. > > The CHARMAP serves three purposes: > 1) Association between Unicode values and byte sequences, used when a > locale is built by localedef. > 2) Documentation for the end users (that's why we have these long > character names in every charmap). > 3) Verification of the corresponding iconv converter. Deviations are > partially noted as %IRREVERSIBLE% in the charmap, partially in a file > named iconvdata/$CODESET.irreversible. Thank you very much for your explanations! :-) BTW, does glibc's CHARMAP strictly follow DTR 14652 (eventually TR 14652 and ISO 14652)? Are there any glibc-specific extension, etc.? :-) http://std.dkuug.dk/jtc1/sc22/wg20/docs/n897-14652w25.pdf Thanks, Best regards, Anthony -- Anthony Fok Tung-Ling ThizLinux Laboratory <anthony@thizlinux.com> http://www.thizlinux.com/ Debian Chinese Project <foka@debian.org> http://www.debian.org/intl/zh/ Come visit Our Lady of Victory Camp! http://www.olvc.ab.ca/ (See attached file: att1.eml)
Attachment:
=?big5?B?YXR0MS5lbWw=?=
Description: Binary data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |