This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Unicode 3.2 support (6)


Hi,
I wrote a big5hkscs.c long time ago. Maybe you want to look at it. You 
can get it at http://www.turbolinux.com.cn/~suzhe/big5hkscs.c.gz

And I think it's better to use CP950 as the base of BIG5-HKSCS. Most 
users use CP950 rather than ordinary BIG5, because Microsoft Windows 
uses CP950. And CP950 is superset of BIG5, it'll be OK to replace BIG5 
with CP950 .

Regards
James Su

Bruno Haible wrote:

>Anthony Fok writes:
>
>>In our big5hkscs.c tables ("BIG5-HKSCS-1999"), the to_unicode function maps
>>to BMP+PUA, whereas from_unicode maps from BMP+PUA+CJK_ExtB back to Big5
>>(i.e. quite a few characters have two-to-one mappings in the from_unicode
>>direction).  It would be best if this two-to-one mappings in the
>>from_unicode direction be kept in both BIG5-HKSCS-2001.
>>
>
>I agree. For user's convenience it is best if the from_unicode
>direction of both BIG5-HKSCS-1999 and BIG5-HKSCS-2001 is identical.
>In other words, each of the two from_unicode converters will then
>accept Unicode text that has been converted by either one of two
>to_unicode converters.
>
>>Andrew Fung of ITSD explained that to me.  big5cmp.txt is mainly for
>>compatibility with old documents using the GCCS (1995) (Government Common
>>Character Set, which predates HKSCS.
>>
>>Here was my question:
>>
>>    Andrew, I was reading the HKSCS Standard in more detail, and I was
>>    wondering how the ITSD would like vendors to handle Annex I, i.e.
>>    support for compatibility code points in GCCS but not in HKSCS,
>>    especially the "unified" ones.  How important is it to handle these
>>    unified characters in the BIG5-HKSCS <-> Unicode tables?  (Mandatory?
>>    Recommended?  Suggested?)
>>
>>    For example, B5+ADC5 and the B5+FA5F variant where the a small portion
>>    is written slightly differently, but ISO 10646 classify these two as
>>    "different glyphs but the same character":
>>
>>           ,-------------------------+-----------------.
>>           |   Big5         ^ ADC5   |  EUDC  ^ FA5F   |
>>           `----------------|--------+--------|--------'
>>                      HKSCS |            GCCS |
>>                            |                 |
>>           ,----------------|--------+--------|--------.
>>           |   Unicode CJK  v U+5029 |  PUA   v U+E01F |
>>           `-------------------------+-----------------'
>>
>>    And which of the following would be the preferred behaviour in a
>>    BIG5-HKSCS <-> Unicode table?
>>
>>     1. "Do nothing".  Keep the FA5F <-> U+E01F mapping in both directions.
>>        (For GCCS, at least there won't be data loss during conversion, but
>>        the GCCS document won't be changed to a HKSCS one either.)
>>
>>     2. FA5F -> U+5029 (unidirectional).
>>
>>     3. U+E01F -> ADC5 (unidirectional).
>>
>>     4. Both 2 and 3.  (B5+FA5F -> U+5029, U+E01F -> B5+ADC5)
>>
>>Andrew replied: ...
>>
>>So, based on Andrew's recommendation, and since GCCS is obsolete,
>>I think we should go with "Option 4" which would in effect normalize
>>documents with GCCS encoding to HKSCS encoding.
>>
>
>Thanks for the explanations. I'm implementing this suggestion in both
>converters.
>
>>such big5hkscs.c tables have already been made by both James and I,
>>so you can use one or the other to save you some time.
>>
>
>I cannot take your converters as-is, because
>1) they contain redundant private area mappings, for example
>   B5+8140 -> U+EEB8, which is nowhere official.
>2) they are apparently based on CP950, not BIG5. For example they map
>   B5+A1C5 to U+02CD, but according to page 108 of e_hkscs.pdf the
>   character U+02CD is not part of BIG5-HKSCS.
>3) they contain additional stuff, like B5+8C40 -> U+503B, which
>   is not found in page 109 of e_hkscs.pdf.
>
>>Or, for that matter, what is the CHARMAP for exactly?  (I just know
>>that there is an ISO Technical Report final draft (14632 or
>>something like that?) about this and other locale stuff.
>>
>
>The CHARMAP serves three purposes:
>1) Association between Unicode values and byte sequences, used when a
>locale is built by localedef.
>2) Documentation for the end users (that's why we have these long
>character names in every charmap).
>3) Verification of the corresponding iconv converter. Deviations are
>partially noted as %IRREVERSIBLE% in the charmap, partially in a file
>named iconvdata/$CODESET.irreversible.
>
>Bruno
>




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]