This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re[2]: Unicode 3.2 support (6)

From: Andrew TC Fung <atcfung at itsd dot gov dot hk>
To: Anthony Fok <anthony at thizlinux dot com>, Bruno Haible <haible at ilog dot fr>
Cc: libc-alpha at sources dot redhat dot com, James Su <suzhe at turbolinux dot com dot cn>, Roger So <roger dot so at sw-linux dot com>
Date: Sat, 20 Apr 2002 12:31:22 +0800
Subject: Re[2]: Unicode 3.2 support (6)

Guys,

>> > or perhaps make two versions of "BIG5-HKSCS" in glibc:
>> > say "BIG5-HKSCS-1999" which maps BIG5-HKSCS to ISO 10646-1:2000+PUA,

I would say ITSD want:
"Big5-HKSCS-2001" <--> ISO 10646-1:2000 + PUA
  (as an interim before the rest of the system are ready for ISO
10646-2:2001); and
"Big5-HKSCS-2001" <--> ISO 10646-1:2000 + ISO 10646-2:2001 + PUA
  (when the rest of the system support ISO 10646-2:2001.  Note: about 35
HKSCS-2001 characters are still put in the PUA as they are not included in
ISO 10646-1:2000 nor ISO 10646-2:2001.)

Yours,
Andrew Fung, APII ITSD

From: Anthony Fok <anthony@thizlinux.com> on 2002/04/19 09:52 AM
To: Bruno Haible <haible@ilog.fr>
cc: libc-alpha@sources.redhat.com, James Su <suzhe@turbolinux.com.cn>,
      Roger So <roger.so@sw-linux.com>, Andrew TC Fung/ITSD/HKSARG@ITSD
Subject: Re: Unicode 3.2 support (6)

On Thu, Apr 18, 2002 at 07:48:44PM +0200, Bruno Haible wrote:
> > So, in the interim, please consider using the following scheme for the
> > default BIG5-HKSCS charmap/converter:
> >
> >     BIG5-HKSCS --> ISO 10646-1:2000 + PUA
> >
> >     PUA + ISO 10646-1:2000 \___\  BIG5-HKSCS
> >           ISO 10646-2:2001 /   /
>
> This is not a migration plan. Real migration would be to convert like
> this:
>
>     BIG5-HKSCS --> ISO 10646-1:2000 + ISO 10646-2:2001
>
>     PUA + ISO 10646-1:2000 \___\  BIG5-HKSCS
>           ISO 10646-2:2001 /   /

Yes, you're right, of course.  :-)  The "ISO 10646-1:2000 + PUA" is the old
semantics.  In 2003 or 2004, we can probably safely switch to
"ISO 10646-1:2000 + ISO 10646-2:2001".  I hope other components on
GNU/Linux
system will be ready by then.  :-)

> > or perhaps make two versions of "BIG5-HKSCS" in glibc:
> > say "BIG5-HKSCS-1999" which maps BIG5-HKSCS to ISO 10646-1:2000+PUA,
>
> That sounds reasonable. I will provide a patch that adds
> BIG5-HKSCS-1999 with the old semantics, for use by people who have not
> upgraded their fonts to use the non-BMP planes.

Thanks for your help.  :-)  BTW, such big5hkscs.c tables have already been
made by both James and I, so you can use one or the other to save you some
time.  CHARMAP stuff will probably need help from you though.  :-)  We are
unsure how glibc handles <Unassigned> and %IRREVERSIBLE stuff in CHARMAP
files yet.  :-)

In our big5hkscs.c tables ("BIG5-HKSCS-1999"), the to_unicode function maps
to BMP+PUA, whereas from_unicode maps from BMP+PUA+CJK_ExtB back to Big5
(i.e. quite a few characters have two-to-one mappings in the from_unicode
direction).  It would be best if this two-to-one mappings in the
from_unicode direction be kept in both BIG5-HKSCS-2001.

(About HKSCS-2001 fonts, well, some major font vendors are still in the
process of making them, so most fonts on the market only conform to
HKSCS-1999 so far.  :-)

> > There is another intricacy with BIG5-HKSCS with unified characters,
> > in big5cmp.txt.  If you like, please take a look at:
> >
> >  http://www.thizlinux.com/~anthony/hkscs/
>
> I don't understand what this big5cmp.txt means for the converters. Can
> you explain in more detail, please?

Andrew Fung of ITSD explained that to me.  big5cmp.txt is mainly for
compatibility with old documents using the GCCS (1995) (Government Common
Character Set, which predates HKSCS.

Here was my question:

    Andrew, I was reading the HKSCS Standard in more detail, and I was
    wondering how the ITSD would like vendors to handle Annex I, i.e.
    support for compatibility code points in GCCS but not in HKSCS,
    especially the "unified" ones.  How important is it to handle these
    unified characters in the BIG5-HKSCS <-> Unicode tables?  (Mandatory?
    Recommended?  Suggested?)

    For example, B5+ADC5 and the B5+FA5F variant where the a small portion
    is written slightly differently, but ISO 10646 classify these two as
    "different glyphs but the same character":

           ,-------------------------+-----------------.
           |   Big5         ^ ADC5   |  EUDC  ^ FA5F   |
           `----------------|--------+--------|--------'
                      HKSCS |            GCCS |
                            |                 |
           ,----------------|--------+--------|--------.
           |   Unicode CJK  v U+5029 |  PUA   v U+E01F |
           `-------------------------+-----------------'

    And which of the following would be the preferred behaviour in a
    BIG5-HKSCS <-> Unicode table?

     1. "Do nothing".  Keep the FA5F <-> U+E01F mapping in both directions.
        (For GCCS, at least there won't be data loss during conversion, but
        the GCCS document won't be changed to a HKSCS one either.)

     2. FA5F -> U+5029 (unidirectional).

     3. U+E01F -> ADC5 (unidirectional).

     4. Both 2 and 3.  (B5+FA5F -> U+5029, U+E01F -> B5+ADC5)

Andrew replied:

   According to our HKSCS Document, compatibility points (CPs) are code
points
   reserved for backward compatibility.  In other words, we simply require
   users/vendors not to use these CPs to define characters.  Also, fonts
   should contain the glyphs for the CPs for displaying old documents that
may
   contain CP.

   In the HKSCS Document, we described CPs by explaining several occasions
   where CPs exist.  However, we do not post an explicit requirement on how
   CPs in Big-5 should be mapped to ISO 10646 or vice versa.

   Nevertheless, based on the descriptions on CPs in the HKSCS Document,
   vendors should be able to decide how to implement the mapping between
Big-5
   and ISO 10646.  For example, a code converter between Big-5 and ISO
10646
   should map CPs in Big-5 to the "correct" ISO 10646 code point and vice
   versa.  In other words, option 4 in your mail should be implemented.
   However, if round trip conversion is important for the application, then
   option 1 in your mail should be implemented.

So, based on Andrew's recommendation, and since GCCS is obsolete,
I think we should go with "Option 4" which would in effect normalize
documents with GCCS encoding to HKSCS encoding.  These are already
implemented in James' or my big5hkscs.c.  A question that we both had was:
how do we reflect that in the CHARMAP?   :-)  Or, for that matter, what is
the CHARMAP for exactly?  (I just know that there is an ISO Technical
Report
final draft (14632 or something like that?) about this and other locale
stuff.

Cheers,

Anthony

--
Anthony Fok Tung-Ling
ThizLinux Laboratory   <anthony@thizlinux.com> http://www.thizlinux.com/
Debian Chinese Project <foka@debian.org>
http://www.debian.org/intl/zh/
Come visit Our Lady of Victory Camp!           http://www.olvc.ab.ca/
(See attached file: att1.eml)

Attachment: att1.eml
Description: Binary data

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]