This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: iconv and combining characters


Hi Chris,

> I noticed that iconv isn't able to convert UTF-8 containing combining
> characters into Latin1.  I really think that iconv should be able to do this.

Why? The preferred way of exchange of Unicode strings is in normalization
form C, see [1], [2].

> do you agree that we should make the L1 converter
> do the same kind of thing?

No. It's better if you avoid generating Unicode strings which are not in NFC.
This way, you'll not only get no problems with iconv, you'll also avoid
problems with XML and HTML parsers and lots of other software.

> But on the other hand, the CP1255 converter handles it either way:

Interesting. Probably the authors thought, like you do now, that handling of
combining characters on input is better than not handling them.

> When I did the same conversions from UTF-8 to CP1255
> in a C program, I noticed that iconv returned 0 in both instances.  Shouldn't
> the second one return a non-zero value since it is irreversible?

Good question as well. Actually the term in POSIX is "non-identical"
conversions, not "irreversible" conversions. If you consider the combined
and decomposed forms as the same, then the return value should be 0. If
you consider it different, then the return value should be 1. I don't see
convincing arguments for either choice.

Bruno


[1] http://www.unicode.org/reports/tr15/
[2] http://www.w3.org/TR/charmod/#sec-ChoiceNFC


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]