This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

codeset problems in wprintf and wcsftime

From: Corinna Vinschen <vinschen at redhat dot com>
To: newlib at sourceware dot org
Date: Sat, 20 Feb 2010 16:59:35 +0100
Subject: codeset problems in wprintf and wcsftime
Reply-to: newlib at sourceware dot org

Hi,

while working on finalizing locale support for Cygwin it suddenly
occured to me that we have a problem in wprintf and wcsftime.

Let's assume a funny combination of localization variables in the user's
environment:

  LANG=de_DE.utf8
  LC_TIME=ja_JP.eucjp
  LC_NUMERIC=en_US.iso88591

Yes, it's pretty unlikely, but nevertheless possible and valid.

So, at setlocale time we read and store the localized strings in the
codeset specified by the localization variable:

  - __locale_charset()             returns UTF-8
  - __get_current_time_locale()    returns data stored in EUC-JP
  - __get_current_numeric_locale() returns data stored in ISO-8859-1
  - localeconv()                   returns with decimal_point and
                                   thousands_sep stored in ISO-8859-1,
				   and all other strings from the
				   LC_MONETARY category in UTF-8.
  - nl_langinfo()                  CODESET is UTF-8,
				   strings from the LC_TIME category are
				   returned in EUC-JP,
				   strings from LC_MESSAGES are returned
				   in UTF-8
				   RADIXCHAR and THOUSEP are returned in
				   ISO-8859-1.

This is no problem at all as long as you call the multibyte variations
printf and strftime, the user gets what she asked for, and who are we
to ask the user for the reason behind this choice.

However, it is a problem in the wprintf and wcsftime functions.  The
problem is that we have decimal_point, thousands_sep and all the LC_TIME
variables stored in some arbitrary multibyte codeset.  Since we need the
widechar representation, wprintf and wcsftime have to convert the
strings using some mbtowc function.  But the mbtowc functions always
assume the multibyte charset defined by __locale_charset().

Consequentially the conversion results in invalid strings.

AFAICS, there are two possible approaches to fix this problem:

- Store the charset not only for LC_CTYPE, but for each localization
  category, and provide a function to request the charset.
  This also requires to store the associated multibyte to widechar
  conversion functions, obviously, and to call the correct functions
  from wprintf and wcftime.

- Redefine the locale data structs so that they contain multibyte and
  widechar representations of all strings.  Use the multibyte strings
  in the multibyte functions, the widechar strings in the widechar
  functions.

Personally I'd prefer the second approach.  The requirement to convert
the strings at runtime is rather unfortunate.

What do you think?

Btw., would it be ok to add more possible arguments to the nl_langinfo()
function, for internal use only?  This approach is used on BSD and
Linux, for instance, to access locale data for which no offical POSIX
API exists.  The grroundwork already exists in langinfo.h, it's just not
used so far.


Corinna

-- 
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat

Follow-Ups:
- Re: codeset problems in wprintf and wcsftime
  - From: Andy Koppe
- Re: codeset problems in wprintf and wcsftime
  - From: Jeff Johnston

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]