codeset problems in wprintf and wcsftime
Jeff Johnston
jjohnstn@redhat.com
Wed Feb 24 21:17:00 GMT 2010
On 24/02/10 04:17 AM, Corinna Vinschen wrote:
> On Feb 23 16:14, Jeff Johnston wrote:
>> On 20/02/10 10:59 AM, Corinna Vinschen wrote:
>>> AFAICS, there are two possible approaches to fix this problem:
>>>
>>> - Store the charset not only for LC_CTYPE, but for each localization
>>> category, and provide a function to request the charset.
>>> This also requires to store the associated multibyte to widechar
>>> conversion functions, obviously, and to call the correct functions
>>> from wprintf and wcftime.
>>>
>>> - Redefine the locale data structs so that they contain multibyte and
>>> widechar representations of all strings. Use the multibyte strings
>>> in the multibyte functions, the widechar strings in the widechar
>>> functions.
>>>
>>
>> This assumes that widechar representations from separate mbtowc
>> converters can be concatenated and be decoded by a single wctomb
>> converter. Without this ability, the concatenated widechar string
>> derived is of no use to anybody unless they know where the charset
>> changes occur.
>>
>> IMO, this is "undefined behaviour".
>
> I don't understand. The wide char representation is Unicode. Why
> should it be a problem to use Unicode strings together, just because
> they are from different sources? Even if wchar_t is UTF-16, as on
> Cygwin, the strings are complete. There's no such thing as just one
> half of a surrogate.
>
So, you are saying if I use the mbtowc for EUC-JP in current newlib and
concatenate that to UTF-16 widechar output and add mbtowc output for
SJIS, a user can simply call wctomb() in newlib and have it pull it all
apart again? This obviously won't work for the old eucjp and sjis
versions of mbtowc/wctomb that Cygwin doesn't currently use, but even
so, I still see 3 versions of wctomb (utf8, iso, and cp) that apply to
Cygwin inside wctomb_r. Am I missing something? How can one of these
functions handle all types of wchar input?
If one cannot take the concatenated string and pass it to a single
internal version of the wctomb() function (i.e. the user has to call 3
versions of wctomb for different charsets), then the user has to know
where each section begins in the full string which makes the end-result
of little use and thus not worth supporting.
> The advantage of having the strings available in wchar_t representation
> would be that the wcsftime and wprintf functions don't have to worry
> about charsets at all. In contrast to the current solution which
> requires a conversion from multibyte which means, you have to *know*
> which source charset was being used when creating these strings. Right
> now they only have information about one charset, which is the LC_CTYPE
> charset.
>
> In Glibc, as well as on Windows, the localization strings are originally
> stored in Unicode on disk, and Glibc stores the strings internally in multibyte
> and wchar_t representation. When Cygwin fetches the strings from Windows
> it has to convert them to multibyte since there is no wchar_t slot for
> the data, and following POSIX, it has to store them in the charset given
> for the locale category, LC_TIME, LC_MESSAGES, etc.
>
>> I think one could optionally flag an error either in the setlocale
>> routine or the wprintf routines themselves.
>
> Well, if the conversion doesn't work, vfwprintf just falls back to the
> defaults for the C locale and switches off grouping. That's probably
> the sanest thing to do.
> If wcsftime fails to convert the format string it returns 0, which is
> the defined error behaviour. In case of the new era and alt_digits
> strings (http://sourceware.org/ml/newlib/2010/msg00153.html), it will
> fail to store the era and alt_digits information and fall back to the
> default behaviour: %EC -> %C, %EY -> %Y, %OH -> %H, etc.
>
> That's probably ok, given the POSIX-1.2008 quote given by Andy in
> http://sourceware.org/ml/newlib/2010/msg00146.html
> I just hoped we could do better.
>
>
> Corinna
>
More information about the Newlib
mailing list