This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: GB18030 (was: Re: charset changes)

From: Andy Koppe <andy dot koppe at gmail dot com>
To: cygwin-developers at cygwin dot com
Date: Sat, 27 Mar 2010 19:15:19 +0000
Subject: Re: GB18030 (was: Re: charset changes)
References: <416096c61003271002t330ee9ecned88f73ef3b4face@mail.gmail.com> <20100327180157.GC18364@calimero.vinschen.de>

On 27 March 2010 18:01, Corinna Vinschen <corinna-cygwin@cygwin.com> wrote:
>> On Vista and 7, if you pass those two bytes to MultiByteToWideChar,
>> you get back the codepage's UnicodeDefaultChar followed by the digit
>> '3'. XP did something else, but I can't remember exactly what.
>
> Heh, ok. ÂIt never occured to me to test the content of the target
> buffer if MultiByteToWideChar failed anyway.

It only fails if the MB_ERR_INVALID_CHARS flag is set.


>> How about implementing __gb18030_mbtowc/wctomb in newlib, which would
>> handle all the mbstate stuff, with the actual encoding and decoding
>> factored out into functions like this:
>>
>> size_t __gb18030_encode(char *dst, const wchar_t *src, size_t
>> src_len): Pass in one codepoint, consisting of one or two wchars
>> (always one in case of a 32-bit wchar_t). Return the length of the
>> resulting multibyte sequence.
>>
>> size_t __gb18030_decode(wchar_t *dst, const char *src, size_t
>> src_len): Pass in a valid multibyte sequence. Return the number of
>> wchars needed to represent it.
>>
>> On Cygwin, these would be straightforward wrappers around
>> WideCharToMultibyte and MultibyteToWideChar with codepage 54936,
>> implemented in winsup. For other newlib targets, we could take a
>> similar approach as with doublebyte charsets, where multibyte
>> sequences are mapped to a non-Unicode wchar_t representation by simply
>> packing the bytes into the wchar_t.
>
> Yet another function call for every single character:
> http://sourceware.org/ml/newlib/2009/msg01033.html

That call would only be necessary for non-ASCII characters, and I
don't think it would be terribly significant compared to the magic
that WideCharToMultibyte and MultibyteToWideChar need to do.

Andy


ps: Btw, speaking of performance issues, the 8-bit charsets are rather
inefficient because for every single non-ASCII character they parse
the charset name to obtain a charset table index. Storing that index
alongside the name might make quite a big difference.

Follow-Ups:
- Re: GB18030 (was: Re: charset changes)
  - From: Corinna Vinschen

References:
- GB18030 (was: Re: charset changes)
  - From: Andy Koppe
- Re: GB18030 (was: Re: charset changes)
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]