This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)


On Sep 27 11:22, Andy Koppe wrote:
> 2009/9/27 Corinna Vinschen:
> > It never occured to me that wcrtomb could return 0 and the calling
> > functions like wcsnrtombs would simply proceed. ?I'll have a look
> > to change __utf8_wctomb accordingly.
> 
> Two further thoughts on allowing lone surrogates:
> - __mb_cur_max for UTF-8 would need to go up to 6 to allow for a lone
> high surrogate followed by a three-byte char.

In newlib (and thus Cygwin) __mb_cur_max is already 6 for UTF-8.

> - Due to the DCxx scheme, the three-byte UTF-8 encoding of DCxx would
> roundtrip to a single-byte xx. Changing the code to something else
> than DCxx wouldn't help.

I don't understand this one.  That's not what I observe after I have
changed the __utf8_wctomb and __utf8_mbtowc functions accordingly.
A single byte 0x80 gets encoded to U+DC80.  The round trip results
in \xed\xb2\x80.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]