This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: GB18030 (was: Re: charset changes)


On 27 March 2010 18:01, Corinna Vinschen <corinna-cygwin@cygwin.com> wrote:
>> On Vista and 7, if you pass those two bytes to MultiByteToWideChar,
>> you get back the codepage's UnicodeDefaultChar followed by the digit
>> '3'. XP did something else, but I can't remember exactly what.
>
> Heh, ok. ÂIt never occured to me to test the content of the target
> buffer if MultiByteToWideChar failed anyway.

It only fails if the MB_ERR_INVALID_CHARS flag is set.


>> How about implementing __gb18030_mbtowc/wctomb in newlib, which would
>> handle all the mbstate stuff, with the actual encoding and decoding
>> factored out into functions like this:
>>
>> size_t __gb18030_encode(char *dst, const wchar_t *src, size_t
>> src_len): Pass in one codepoint, consisting of one or two wchars
>> (always one in case of a 32-bit wchar_t). Return the length of the
>> resulting multibyte sequence.
>>
>> size_t __gb18030_decode(wchar_t *dst, const char *src, size_t
>> src_len): Pass in a valid multibyte sequence. Return the number of
>> wchars needed to represent it.
>>
>> On Cygwin, these would be straightforward wrappers around
>> WideCharToMultibyte and MultibyteToWideChar with codepage 54936,
>> implemented in winsup. For other newlib targets, we could take a
>> similar approach as with doublebyte charsets, where multibyte
>> sequences are mapped to a non-Unicode wchar_t representation by simply
>> packing the bytes into the wchar_t.
>
> Yet another function call for every single character:
> http://sourceware.org/ml/newlib/2009/msg01033.html

That call would only be necessary for non-ASCII characters, and I
don't think it would be terribly significant compared to the magic
that WideCharToMultibyte and MultibyteToWideChar need to do.

Andy


ps: Btw, speaking of performance issues, the 8-bit charsets are rather
inefficient because for every single non-ASCII character they parse
the charset name to obtain a charset table index. Storing that index
alongside the name might make quite a big difference.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]