This is the mail archive of the
libc-locales@sourceware.org
mailing list for the GNU libc locales project.
Output of `locale -a` could be in mixed encodings?
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: GNU C Library <libc-alpha at sourceware dot org>, libc-locales at sourceware dot org
- Date: Tue, 20 Jan 2015 20:38:49 -0500
- Subject: Output of `locale -a` could be in mixed encodings?
- Authentication-results: sourceware.org; auth=none
I'm going to ramble a bit here because the problem is rambling.
The output of `locale -a` can't be easily grepped.
[carlos@athas intl]$ locale -a | grep bok
Binary file (standard input) matches
The name of various localizations are written in their respective
encodings e.g. ISO-8859-1.
Thus the Bokmal name is output in ISO-8859-1 along with an ASCII
version. This makes it difficult to use grep to parse `locale -a`
output in anything but ISO-8859-1.
e.g.
[carlos@athas intl]$ export LANG=C
[carlos@athas intl]$ locale -a | grep bok
bokmal
bokmïl
A naive fix is for `locale` to examine the present locale and
use iconv to convert the names to the target locale. So for example
if the user is using en_US.UTF8 then the above would get converted
to:
bokmal
bokmÃl
There is also one more ISO-8859-1 name in locale.alias with a
diacritic:
franÃais
The problem then is that if you took that UTF8 converted name of
`bokmÃl` and tried to call setlocale with that, it would fail.
It fails because the name in UTF8 doesn't match the name in
ISO-8859-1 that's stored as the alias or official locale name.
That is to say that you could have two apparently identical source
files, one works (encoded in ISO-8859-1) and one doesn't (encoded
UTF-8). This is because setlocale takes a `char *` as input for
the name of the locale.
e.g.
cat >> setlocale_iso88591.c <<EOF
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
char bokmal_iso88591[] = "bokmïl";
int
main (void)
{
char *result;
result = setlocale(LC_ALL, bokmal_iso88591);
if (result == NULL)
{
perror ("setlocale");
exit(1);
}
printf ("setlocale() = %s\n", result);
return 0;
}
EOF
cat >> setlocale_utf8.c <<EOF
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
char bokmal_utf8[] = "bokmÃl";
int
main (void)
{
char *result;
result = setlocale(LC_ALL, bokmal_utf8);
if (result == NULL)
{
perror ("setlocale");
exit(1);
}
printf ("setlocale() = %s\n", result);
return 0;
}
cat >> build.sh <<EOF
gcc -Wall -pedantic -O0 -g3 -o setlocale_fail setlocale_utf8.c
gcc -Wall -pedantic -O0 -g3 -o setlocale_pass setlocale_iso88591.c
EOF
chmod u+x build.sh
./build.sh
[carlos@athas setlocale]$ ./setlocale_fail
setlocale: No such file or directory
[carlos@athas setlocale]$ ./setlocale_pass
setlocale() = bokmïl
The literal bytes passed to setlocale for the name of the locale
must be in ISO-8859-1 in order to be identified as the
nb_NO.ISO-8859-1 locale that is eventually loaded.
This means that changing the output encoding from `locale -a`
would break programs trying to use that output to set a
locale.
Using `locale -a -v` you can see that it's an ISO-8859-1 locale,
and surmise the name of the locale is encoded in ISO-8859-1, and
that you need to convert it to display it in UTF8 correctly.
e.g.
locale: bokm<E5>l archive: /usr/lib/locale/locale-archive
-------------------------------------------------------------------------------
title | Norwegian (Bokmal) locale for Norway
source | Norsk Standardiseringsforbund
address | University Library, Drammensveien 41, N-9242 Oslo, Norge
email | bug-glibc-locales@gnu.org
language | Norwegian, Bokm<E5>l
territory | Norway
revision | 1.0
date | 2000-06-29
codeset | ISO-8859-1
In summary:
The output of `locale -a` could be in mixed encodings.
The locale name must be exactly as `locale -a` prints it for it
to work with setlocale(), those exact bytes.
You can't easily use grep to process the output of `locale -a`.
We should stop using aliases that are anything but ASCII to avoid
future problems.
Questions:
Can we make this any better?
Cheers,
Carlos.