This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Output of `locale -a` could be in mixed encodings?

From: Joseph Myers <joseph at codesourcery dot com>
To: Carlos O'Donell <carlos at redhat dot com>
Cc: GNU C Library <libc-alpha at sourceware dot org>, <libc-locales at sourceware dot org>
Date: Wed, 21 Jan 2015 02:18:10 +0000
Subject: Re: Output of `locale -a` could be in mixed encodings?
Authentication-results: sourceware.org; auth=none
References: <54BF0329 dot 5050604 at redhat dot com>

On Tue, 20 Jan 2015, Carlos O'Donell wrote:

> The problem then is that if you took that UTF8 converted name of
> `bokmÃl` and tried to call setlocale with that, it would fail.
> It fails because the name in UTF8 doesn't match the name in
> ISO-8859-1 that's stored as the alias or official locale name.

This could be a bug in setlocale.

POSIX says the locale name is a "character string", which is defined as a 
sequence of multibyte characters.  So arguably it should be interpreted in 
the current locale's character set (and so work if the LC_CTYPE before 
setlocale is that of a UTF-8 locale, fail if it's ASCII or ISO-8859-1).  
Except that the statement about being a character string is not CX-shaded, 
so should not be taken as intending any semantics beyond those in ISO C, 
and I don't see ISO C requiring any such thing.  (That said, I think 
interpreting the locale name in the current locale makes sense anyway, and 
is at least consistent with ISO C, even if not required.)

Now, we should also probably say that all non-ASCII locale names are 
deprecated (so this would just be a matter of adding a few more aliases 
for this locale using different encodings).  And then we could say that 
the locale utility doesn't output any non-ASCII locale names - as long as 
each locale has a valid ASCII name, I think that's conforming to POSIX.  
In fact, these aliases are already deprecated (locale.alias says "This 
file is obsolete ... Nobody should rely on the names defined here").

It's also the case that there's an existing weak deprecation of non-UTF-8 
locales (in the sense that every locale with a non-UTF-8 character set is 
supposed to have a corresponding locale with UTF-8 character set - if any 
don't, that's a bug unless there's some other reason for the locale to be 
deprecated whatever the character set - and the threshold for adding any 
new non-UTF-8 locales should be higher than for adding new UTF-8 locales).

>  language | Norwegian, Bokm<E5>l

That part of the output, however, should clearly be output in the user's 
locale character set - not in the character set of the locale in question.

-- 
Joseph S. Myers
joseph@codesourcery.com

References:
- Output of `locale -a` could be in mixed encodings?
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]