This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/22371] U+FFE2 and U+FFE4, iconv does not convert to HALFWIDTH(EUC-JISX0213)


https://sourceware.org/bugzilla/show_bug.cgi?id=22371

--- Comment #6 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Akira Nakajima from comment #5)
> \xa1\xef is mapped to U+00A5(HALFWIDTH YEN) in EUC-JISX0213 and EUC-JIS-2004
> by following URL.
> \xa1\xb1 is as same.
> 
> "However, with Unicode 3.2.0 the mappings differ in 3 codepoints."
> http://search.cpan.org/~dankogai/Encode-JIS2K-0.03/JIS2K.
> pm#what_is_JIS_X_0213_anyway?
> 
> =============================================
> http://charset.uic.jp/show/eucjisx0213/
> http://x0213.org/codetable/euc-jis-2004-with-char.txt
> char    JIS     Unicode
>  ̄	0xA1B1	U+203E	# OVERLINE	Windows: U+FFE3
> ―	0xA1BD	U+2014	# EM DASH	Windows: U+2015
> ¥	0xA1EF	U+00A5	# YEN SIGN	Windows: U+FFE5
> =============================================

This is not an authoritative source. I do not believe this is correct. Notice
that there is a comment that on Windows it is U+FFE5, which is the FULLWIDTH
YEN. That is in my opinion correct. Therefore Windows and Linux have the same
representation e.g. FULLWIDTH YEN. I suggest discussing this with the author of
the document.

> =============================================
> perl 5.24.3
> 
> # perl -e 'use Encode; use Encode::JISX0213; print encode("euc-jisx0213",
> "\x{00a5}");' | od -tx1
> 0000000 a1 ef
> # perl -e 'use Encode; use Encode::JISX0213; print encode("euc-jisx0213",
> "\x{ffe5}");' | od -tx1
> 0000000 a1 ef
> =============================================

This behaviour is expected. You are encoding a unicode code point into
EUC-JISX0213. There is no representation of YEN, so the output is FULLWIDTH YEN
in both cases e.g. /xa1/xef.

> But Python and "/usr/local/share/i18n/charmaps/EUC-JISX0213.gz"
>  have mapping to U+FFE5.
> I don't know which one is correct.

This is correct. As it is on Windows.

> =============================================
> Python 3.6.2
> 
> # python3 -c "print(u'\u00a5'.encode('euc-jisx0213'))"
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
> UnicodeEncodeError: 'euc_jisx0213' codec can't encode character '\xa5' in
> position 0: illegal multibyte sequence

Correct. There is no YEN in EUC-JISX0213. This python behaviour is based on
glibc's character map.

> # python3 -c "print(u'\uffe5'.encode('euc-jisx0213'))"
> b'\xa1\xef'
> =============================================
> 
> =============================================
> /usr/local/share/i18n/charmaps/EUC-JISX0213.gz (Fedora 26)
> 
> <UFFE3>     /xa1/xb1     FULLWIDTH MACRON
> <UFFE5>     /xa1/xef     FULLWIDTH YEN SIGN
> =============================================

These are IMO correct.

In my previous post I referenced *official* Japanese ISO-IR documents, and I
will reference them again:

See, page 3 of the PDF, note that the 8th bit is always set:
https://www.itscj.ipsj.or.jp/iso-ir/228.pdf

You can see that the macron is a FULLWIDTH MACRO, and with the yen sign the
FULLWIDTH YEN is selected because it provides compatibility with Windows.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]