This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Unicode 3.2 support (6)


Anthony Fok writes:

> In our big5hkscs.c tables ("BIG5-HKSCS-1999"), the to_unicode function maps
> to BMP+PUA, whereas from_unicode maps from BMP+PUA+CJK_ExtB back to Big5
> (i.e. quite a few characters have two-to-one mappings in the from_unicode
> direction).  It would be best if this two-to-one mappings in the
> from_unicode direction be kept in both BIG5-HKSCS-2001.

I agree. For user's convenience it is best if the from_unicode
direction of both BIG5-HKSCS-1999 and BIG5-HKSCS-2001 is identical.
In other words, each of the two from_unicode converters will then
accept Unicode text that has been converted by either one of two
to_unicode converters.

> Andrew Fung of ITSD explained that to me.  big5cmp.txt is mainly for
> compatibility with old documents using the GCCS (1995) (Government Common
> Character Set, which predates HKSCS.
> 
> Here was my question:
> 
>     Andrew, I was reading the HKSCS Standard in more detail, and I was
>     wondering how the ITSD would like vendors to handle Annex I, i.e.
>     support for compatibility code points in GCCS but not in HKSCS,
>     especially the "unified" ones.  How important is it to handle these
>     unified characters in the BIG5-HKSCS <-> Unicode tables?  (Mandatory?
>     Recommended?  Suggested?)
> 
>     For example, B5+ADC5 and the B5+FA5F variant where the a small portion
>     is written slightly differently, but ISO 10646 classify these two as
>     "different glyphs but the same character":
> 
>            ,-------------------------+-----------------.
>            |   Big5         ^ ADC5   |  EUDC  ^ FA5F   |
>            `----------------|--------+--------|--------'
>                       HKSCS |            GCCS |
>                             |                 |
>            ,----------------|--------+--------|--------.
>            |   Unicode CJK  v U+5029 |  PUA   v U+E01F |
>            `-------------------------+-----------------'
> 
>     And which of the following would be the preferred behaviour in a
>     BIG5-HKSCS <-> Unicode table?
> 
>      1. "Do nothing".  Keep the FA5F <-> U+E01F mapping in both directions.
>         (For GCCS, at least there won't be data loss during conversion, but
>         the GCCS document won't be changed to a HKSCS one either.)
> 
>      2. FA5F -> U+5029 (unidirectional).
> 
>      3. U+E01F -> ADC5 (unidirectional).
> 
>      4. Both 2 and 3.  (B5+FA5F -> U+5029, U+E01F -> B5+ADC5)
> 
> Andrew replied: ...
> 
> So, based on Andrew's recommendation, and since GCCS is obsolete,
> I think we should go with "Option 4" which would in effect normalize
> documents with GCCS encoding to HKSCS encoding.

Thanks for the explanations. I'm implementing this suggestion in both
converters.

> such big5hkscs.c tables have already been made by both James and I,
> so you can use one or the other to save you some time.

I cannot take your converters as-is, because
1) they contain redundant private area mappings, for example
   B5+8140 -> U+EEB8, which is nowhere official.
2) they are apparently based on CP950, not BIG5. For example they map
   B5+A1C5 to U+02CD, but according to page 108 of e_hkscs.pdf the
   character U+02CD is not part of BIG5-HKSCS.
3) they contain additional stuff, like B5+8C40 -> U+503B, which
   is not found in page 109 of e_hkscs.pdf.

> Or, for that matter, what is the CHARMAP for exactly?  (I just know
> that there is an ISO Technical Report final draft (14632 or
> something like that?) about this and other locale stuff.

The CHARMAP serves three purposes:
1) Association between Unicode values and byte sequences, used when a
locale is built by localedef.
2) Documentation for the end users (that's why we have these long
character names in every charmap).
3) Verification of the corresponding iconv converter. Deviations are
partially noted as %IRREVERSIBLE% in the charmap, partially in a file
named iconvdata/$CODESET.irreversible.

Bruno


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]