This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: iconv and combining characters


Bruno,

> Hi Chris,
> 
> > I noticed that iconv isn't able to convert UTF-8 containing combining
> > characters into Latin1.  I really think that iconv should be able to do this.
> 
> Why? The preferred way of exchange of Unicode strings is in normalization
> form C, see [1], [2].

I agree with that.  But will everyone follow that rule?  Sooner or later,
you will come across a file or string that is not in NFC, and I think it
would be very useful if iconv could handle it.

I guess I like to live by the motto "be lenient in what you accept,
conservative with what you generate". In other words, I would want iconv
to handle any unnormalized Unicode, but would expect it to generate only
NFC.

Since non-NFC Unicode is rather uncommon, another less intrusive
approach may be better: use a separate codeset name for Unicode that may
be non-NFC.  Something like:
   iconv -f UTF-8-UNNORMALIZED -t L1
This has the advantage of not having any speed/memory penalty for those
who know their data is NFC.  Also the normalization could be programmed
just once in an INTERNAL-UNNORMALIZED -> INTERNAL transcoder.

If this is something you think would be appropriate to add to the gconv
converter collection, I would be happy to work on it.

> > do you agree that we should make the L1 converter
> > do the same kind of thing?
> 
> No. It's better if you avoid generating Unicode strings which are not in NFC.
> This way, you'll not only get no problems with iconv, you'll also avoid
> problems with XML and HTML parsers and lots of other software.

Agreed.  But I'm talking about reading non-NFC Unicode, not generating
it.

> > But on the other hand, the CP1255 converter handles it either way:
> 
> Interesting. Probably the authors thought, like you do now, that handling of
> combining characters on input is better than not handling them.

Moreover, I just noticed that U+FB1D HEBREW LETTER YOD WITH HIRIQ is in
the composition exclusion list, so that means iconv is not producing NFC
is this case.

> printf '\xE9\xC4' | iconv -f CP1255 -t UTF-8 | od -tx1
0000000 ef ac 9d
0000003

> > When I did the same conversions from UTF-8 to CP1255
> > in a C program, I noticed that iconv returned 0 in both instances.  Shouldn't
> > the second one return a non-zero value since it is irreversible?
> 
> Good question as well. Actually the term in POSIX is "non-identical"
> conversions, not "irreversible" conversions. If you consider the combined
> and decomposed forms as the same, then the return value should be 0. If
> you consider it different, then the return value should be 1. I don't see
> convincing arguments for either choice.

OK, yes, I could go either way with this, too.

Chris


> Bruno
> 
> 
> [1] http://www.unicode.org/reports/tr15/
> [2] http://www.w3.org/TR/charmod/#sec-ChoiceNFC



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]