This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

iconv and combining characters

From: Chris Heath <chris at heathens dot co dot nz>
To: libc-alpha at gnu dot org
Date: Sun, 18 Jan 2004 13:42:44 -0500
Subject: iconv and combining characters

Hi,

I noticed that iconv isn't able to convert UTF-8 containing combining
characters into Latin1.  I really think that iconv should be able to do this.

> printf 'A\xCC\x80' | iconv -f UTF-8 -t L1
Aiconv: illegal input sequence at position 1

(\xCC\x80 is UTF-8 for U+0300 COMBINING GRAVE ACCENT.)

The same is true for most other 8-bit encodings as well. But on the other hand,
the CP1255 converter handles it either way:

> printf '\xEF\xAC\x9D' | iconv -f UTF-8 -t CP1255 | od -tx1
0000000 e9 c4
0000002
> printf '\xD7\x99\xD6\xB4' | iconv -f UTF-8 -t CP1255 | od -tx1
0000000 e9 c4
0000002

(\xEF\xAC\x9D is UTF-8 for U+FB1D HEBREW LETTER YOD WITH HIRIQ.)
(\xD7\x99 is UTF-8 for U+05D9 HEBREW LETTER YOD.)
(\xD6\xB4 is UTF-8 for U+05B4 HEBREW POINT HIRIQ.)

So, my main question here is: do you agree that we should make the L1 converter
do the same kind of thing?

Now for a side issue.  When I did the same conversions from UTF-8 to CP1255
in a C program, I noticed that iconv returned 0 in both instances.  Shouldn't
the second one return a non-zero value since it is irreversible?

Chris

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]