This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug localedata/22371] U+FFE2 and U+FFE4, iconv does not convert to HALFWIDTH(EUC-JISX0213)
- From: "carlos at redhat dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Thu, 02 Nov 2017 15:06:25 +0000
- Subject: [Bug localedata/22371] U+FFE2 and U+FFE4, iconv does not convert to HALFWIDTH(EUC-JISX0213)
- Auto-submitted: auto-generated
- References: <bug-22371-131@http.sourceware.org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=22371
--- Comment #6 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Akira Nakajima from comment #5)
> \xa1\xef is mapped to U+00A5(HALFWIDTH YEN) in EUC-JISX0213 and EUC-JIS-2004
> by following URL.
> \xa1\xb1 is as same.
>
> "However, with Unicode 3.2.0 the mappings differ in 3 codepoints."
> http://search.cpan.org/~dankogai/Encode-JIS2K-0.03/JIS2K.
> pm#what_is_JIS_X_0213_anyway?
>
> =============================================
> http://charset.uic.jp/show/eucjisx0213/
> http://x0213.org/codetable/euc-jis-2004-with-char.txt
> char JIS Unicode
>  ̄ 0xA1B1 U+203E # OVERLINE Windows: U+FFE3
> ― 0xA1BD U+2014 # EM DASH Windows: U+2015
> ¥ 0xA1EF U+00A5 # YEN SIGN Windows: U+FFE5
> =============================================
This is not an authoritative source. I do not believe this is correct. Notice
that there is a comment that on Windows it is U+FFE5, which is the FULLWIDTH
YEN. That is in my opinion correct. Therefore Windows and Linux have the same
representation e.g. FULLWIDTH YEN. I suggest discussing this with the author of
the document.
> =============================================
> perl 5.24.3
>
> # perl -e 'use Encode; use Encode::JISX0213; print encode("euc-jisx0213",
> "\x{00a5}");' | od -tx1
> 0000000 a1 ef
> # perl -e 'use Encode; use Encode::JISX0213; print encode("euc-jisx0213",
> "\x{ffe5}");' | od -tx1
> 0000000 a1 ef
> =============================================
This behaviour is expected. You are encoding a unicode code point into
EUC-JISX0213. There is no representation of YEN, so the output is FULLWIDTH YEN
in both cases e.g. /xa1/xef.
> But Python and "/usr/local/share/i18n/charmaps/EUC-JISX0213.gz"
> have mapping to U+FFE5.
> I don't know which one is correct.
This is correct. As it is on Windows.
> =============================================
> Python 3.6.2
>
> # python3 -c "print(u'\u00a5'.encode('euc-jisx0213'))"
> Traceback (most recent call last):
> File "<string>", line 1, in <module>
> UnicodeEncodeError: 'euc_jisx0213' codec can't encode character '\xa5' in
> position 0: illegal multibyte sequence
Correct. There is no YEN in EUC-JISX0213. This python behaviour is based on
glibc's character map.
> # python3 -c "print(u'\uffe5'.encode('euc-jisx0213'))"
> b'\xa1\xef'
> =============================================
>
> =============================================
> /usr/local/share/i18n/charmaps/EUC-JISX0213.gz (Fedora 26)
>
> <UFFE3> /xa1/xb1 FULLWIDTH MACRON
> <UFFE5> /xa1/xef FULLWIDTH YEN SIGN
> =============================================
These are IMO correct.
In my previous post I referenced *official* Japanese ISO-IR documents, and I
will reference them again:
See, page 3 of the PDF, note that the 8th bit is always set:
https://www.itscj.ipsj.or.jp/iso-ir/228.pdf
You can see that the macron is a FULLWIDTH MACRO, and with the yen sign the
FULLWIDTH YEN is selected because it provides compatibility with Windows.
--
You are receiving this mail because:
You are on the CC list for the bug.