This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
On Sep 27 11:22, Andy Koppe wrote:
> 2009/9/27 Corinna Vinschen:
> > It never occured to me that wcrtomb could return 0 and the calling
> > functions like wcsnrtombs would simply proceed. ?I'll have a look
> > to change __utf8_wctomb accordingly.
>
> Two further thoughts on allowing lone surrogates:
> - __mb_cur_max for UTF-8 would need to go up to 6 to allow for a lone
> high surrogate followed by a three-byte char.
In newlib (and thus Cygwin) __mb_cur_max is already 6 for UTF-8.
> - Due to the DCxx scheme, the three-byte UTF-8 encoding of DCxx would
> roundtrip to a single-byte xx. Changing the code to something else
> than DCxx wouldn't help.
I don't understand this one. That's not what I observe after I have
changed the __utf8_wctomb and __utf8_mbtowc functions accordingly.
A single byte 0x80 gets encoded to U+DC80. The round trip results
in \xed\xb2\x80.
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat