This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug libc/20538] New: Update EUC-KR?

From: "jehan.marmottard at gmail dot com" <sourceware-bugzilla at sourceware dot org>
To: glibc-bugs at sourceware dot org
Date: Tue, 30 Aug 2016 16:16:26 +0000
Subject: [Bug libc/20538] New: Update EUC-KR?
Auto-submitted: auto-generated

https://sourceware.org/bugzilla/show_bug.cgi?id=20538

            Bug ID: 20538
           Summary: Update EUC-KR?
           Product: glibc
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: libc
          Assignee: unassigned at sourceware dot org
          Reporter: jehan.marmottard at gmail dot com
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

Hi,

I have a bunch of files in EUC-KR which breaks on iconv with "illegal input
sequence" (tested with master of glibc as well).
They convert OK with CP949 so my first idea would be to assume that the files
are Microsoft CodePage 949, and not EUC-KR (Unified Hangeul Code/CodePage 949
is said to be a superset of EUC-KR so that would explain why conversion is
still globally good).

But I can also find various literature which seems to indicate that maybe
glibc's iconv implementation may not be up-to-date.
As an example, iconv blocked on the character '됀' (unicode 0xB400), encoded as
0x89c2. I can see that euckr_from_ucs4() would just let the first byte pass
through, so obviously it breaks just after:

>    if (ch <= 0x9f)
>      ++inptr;

And clearly the rest of the code does not work either for these 2 bytes. But
according to some references, EUC-KR should actually be able to encode this
character.

* The WhatWG describes a EUC-KR decoding algorithm quite different from glib's
iconv implementation: https://encoding.spec.whatwg.org/#euc-kr
And this character is in the list:
https://encoding.spec.whatwg.org/index-euc-kr.txt

* I also found this Unicode mapping, apparently a 1992 revision of KSC5601:
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT
It also lists this character, with the same coding as WhatWG.

Now I am a little lost since I don't manage to find a single official reference
spec for EUC-KR. All official listings will cite RFC 1557
(https://tools.ietf.org/html/rfc1557) which just does give no real details
about the EUC-KR encoding. So I can't know for sure if EUC-KR (de)coding in
glibc is right or not, and all texts I can find about this encoding are
extremely messy and incomplete.

Could you shed some light on this issue please?
If it turns out that the EUC-KR algorithm in glibc should be updated, I would
be OK to do this patch if needed. I'd appreciate a hint to the right
specification to be followed though. :-)
Thanks.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]