This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: iconv and combining characters

From: Bruno Haible <bruno at clisp dot org>
To: Chris Heath <chris at heathens dot co dot nz>
Cc: libc-alpha at sources dot redhat dot com
Date: Tue, 20 Jan 2004 20:40:37 +0100
Subject: Re: iconv and combining characters

Hi Chris,

> I noticed that iconv isn't able to convert UTF-8 containing combining
> characters into Latin1.  I really think that iconv should be able to do this.

Why? The preferred way of exchange of Unicode strings is in normalization
form C, see [1], [2].

> do you agree that we should make the L1 converter
> do the same kind of thing?

No. It's better if you avoid generating Unicode strings which are not in NFC.
This way, you'll not only get no problems with iconv, you'll also avoid
problems with XML and HTML parsers and lots of other software.

> But on the other hand, the CP1255 converter handles it either way:

Interesting. Probably the authors thought, like you do now, that handling of
combining characters on input is better than not handling them.

> When I did the same conversions from UTF-8 to CP1255
> in a C program, I noticed that iconv returned 0 in both instances.  Shouldn't
> the second one return a non-zero value since it is irreversible?

Good question as well. Actually the term in POSIX is "non-identical"
conversions, not "irreversible" conversions. If you consider the combined
and decomposed forms as the same, then the return value should be 0. If
you consider it different, then the return value should be 1. I don't see
convincing arguments for either choice.

Bruno


[1] http://www.unicode.org/reports/tr15/
[2] http://www.w3.org/TR/charmod/#sec-ChoiceNFC

Follow-Ups:
- Re: iconv and combining characters
  - From: Chris Heath

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]