This is the mail archive of the
libc-locales@sourceware.org
mailing list for the GNU libc locales project.
Re: Output of `locale -a` could be in mixed encodings?
- From: Martin Sebor <msebor at redhat dot com>
- To: "Carlos O'Donell" <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>, libc-locales at sourceware dot org
- Date: Wed, 21 Jan 2015 09:19:31 -0700
- Subject: Re: Output of `locale -a` could be in mixed encodings?
- Authentication-results: sourceware.org; auth=none
- References: <54BF0329 dot 5050604 at redhat dot com>
On 01/20/2015 06:38 PM, Carlos O'Donell wrote:
I'm going to ramble a bit here because the problem is rambling.
The output of `locale -a` can't be easily grepped.
[carlos@athas intl]$ locale -a | grep bok
Binary file (standard input) matches
The name of various localizations are written in their respective
encodings e.g. ISO-8859-1.
Thus the Bokmal name is output in ISO-8859-1 along with an ASCII
version. This makes it difficult to use grep to parse `locale -a`
output in anything but ISO-8859-1.
e.g.
[carlos@athas intl]$ export LANG=C
[carlos@athas intl]$ locale -a | grep bok
bokmal
bokmïl
Alternatively, the GNU grep -a option works:
$ LC_ALL=$(locale -a | grep -a bokm | tail -n1) locale | grep -a LC_ALL
LC_ALL=bokmïl
A naive fix is for `locale` to examine the present locale and
use iconv to convert the names to the target locale. So for example
if the user is using en_US.UTF8 then the above would get converted
to:
I'm not sure if POSIX intends to allow that when the -a option
is used or whether what the implementation does in that case
is unspecified:
The application shall ensure that the LANG, LC_* , and [XSI]
[Option Start] NLSPATH [Option End] environment variables
specify the current locale environment to be written out;
they shall be used if the -a option is not specified.
Perhaps because (as you noted) converting the string to some
other encoding would make it unusable as the name of the same
locale.
In summary:
The output of `locale -a` could be in mixed encodings.
The locale name must be exactly as `locale -a` prints it for it
to work with setlocale(), those exact bytes.
You can't easily use grep to process the output of `locale -a`.
We should stop using aliases that are anything but ASCII to avoid
future problems.
This seems like the same problem as with file names that
contain non-ASCII characters. The only robust solution is
to avoid using such characters. Both by the implementation
and by applications (e.g., in locale names created by users
via the localedef utility).
Martin