codeset problems in wprintf and wcsftime

Wed Feb 24 21:17:00 GMT 2010

On 24/02/10 04:17 AM, Corinna Vinschen wrote:
> On Feb 23 16:14, Jeff Johnston wrote:
>> On 20/02/10 10:59 AM, Corinna Vinschen wrote:
>>> AFAICS, there are two possible approaches to fix this problem:
>>>
>>> - Store the charset not only for LC_CTYPE, but for each localization
>>>    category, and provide a function to request the charset.
>>>    This also requires to store the associated multibyte to widechar
>>>    conversion functions, obviously, and to call the correct functions
>>>    from wprintf and wcftime.
>>>
>>> - Redefine the locale data structs so that they contain multibyte and
>>>    widechar representations of all strings.  Use the multibyte strings
>>>    in the multibyte functions, the widechar strings in the widechar
>>>    functions.
>>>
>>
>> This assumes that widechar representations from separate mbtowc
>> converters can be concatenated and be decoded by a single wctomb
>> converter.  Without this ability, the concatenated widechar string
>> derived is of no use to anybody unless they know where the charset
>> changes occur.
>>
>> IMO, this is "undefined behaviour".
>
> I don't understand.  The wide char representation is Unicode.  Why
> should it be a problem to use Unicode strings together, just because
> they are from different sources?  Even if wchar_t is UTF-16, as on
> Cygwin, the strings are complete.  There's no such thing as just one
> half of a surrogate.
>

So, you are saying if I use the mbtowc for EUC-JP in current newlib and 
concatenate that to UTF-16 widechar output and add mbtowc output for 
SJIS, a user can simply call wctomb() in newlib and have it pull it all 
apart again?  This obviously won't work for the old eucjp and sjis 
versions of mbtowc/wctomb that Cygwin doesn't currently use, but even 
so, I still see 3 versions of wctomb (utf8, iso, and cp) that apply to 
Cygwin inside wctomb_r.  Am I missing something?  How can one of these 
functions handle all types of wchar input?

If one cannot take the concatenated string and pass it to a single 
internal version of the wctomb() function (i.e. the user has to call 3 
versions of wctomb for different charsets), then the user has to know 
where each section begins in the full string which makes the end-result 
of little use and thus not worth supporting.

> The advantage of having the strings available in wchar_t representation
> would be that the wcsftime and wprintf functions don't have to worry
> about charsets at all.  In contrast to the current solution which
> requires a conversion from multibyte which means, you have to *know*
> which source charset was being used when creating these strings.  Right
> now they only have information about one charset, which is the LC_CTYPE
> charset.
>
> In Glibc, as well as on Windows, the localization strings are originally
> stored in Unicode on disk, and Glibc stores the strings internally in multibyte
> and wchar_t representation.  When Cygwin fetches the strings from Windows
> it has to convert them to multibyte since there is no wchar_t slot for
> the data, and following POSIX, it has to store them in the charset given
> for the locale category, LC_TIME, LC_MESSAGES, etc.
>
>> I think one could optionally flag an error either in the setlocale
>> routine or the wprintf routines themselves.
>
> Well, if the conversion doesn't work, vfwprintf just falls back to the
> defaults for the C locale and switches off grouping.  That's probably
> the sanest thing to do.
> If wcsftime fails to convert the format string it returns 0, which is
> the defined error behaviour.  In case of the new era and alt_digits
> strings (http://sourceware.org/ml/newlib/2010/msg00153.html), it will
> fail to store the era and alt_digits information and fall back to the
> default behaviour:  %EC ->  %C, %EY ->  %Y, %OH ->  %H, etc.
>
> That's probably ok, given the POSIX-1.2008 quote given by Andy in
> http://sourceware.org/ml/newlib/2010/msg00146.html
> I just hoped we could do better.
>
>
> Corinna
>