This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Output of `locale -a` could be in mixed encodings?

From: "Carlos O'Donell" <carlos at redhat dot com>
To: GNU C Library <libc-alpha at sourceware dot org>, libc-locales at sourceware dot org
Date: Tue, 20 Jan 2015 20:38:49 -0500
Subject: Output of `locale -a` could be in mixed encodings?
Authentication-results: sourceware.org; auth=none

I'm going to ramble a bit here because the problem is rambling.

The output of `locale -a` can't be easily grepped.

[carlos@athas intl]$ locale -a | grep bok
Binary file (standard input) matches

The name of various localizations are written in their respective
encodings e.g. ISO-8859-1.

Thus the Bokmal name is output in ISO-8859-1 along with an ASCII
version. This makes it difficult to use grep to parse `locale -a`
output in anything but ISO-8859-1.

e.g.
[carlos@athas intl]$ export LANG=C
[carlos@athas intl]$ locale -a | grep bok
bokmal
bokmïl

A naive fix is for `locale` to examine the present locale and
use iconv to convert the names to the target locale. So for example
if the user is using en_US.UTF8 then the above would get converted
to:

bokmal
bokmÃl

There is also one more ISO-8859-1 name in locale.alias with a 
diacritic:

franÃais

The problem then is that if you took that UTF8 converted name of
`bokmÃl` and tried to call setlocale with that, it would fail.
It fails because the name in UTF8 doesn't match the name in
ISO-8859-1 that's stored as the alias or official locale name.

That is to say that you could have two apparently identical source
files, one works (encoded in ISO-8859-1) and one doesn't (encoded
UTF-8). This is because setlocale takes a `char *` as input for
the name of the locale.

e.g.

cat >> setlocale_iso88591.c <<EOF
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
char bokmal_iso88591[] = "bokmïl";
int
main (void)
{
  char *result;
  result = setlocale(LC_ALL, bokmal_iso88591);
  if (result == NULL)
    {
      perror ("setlocale");
      exit(1);
    }
  printf ("setlocale() = %s\n", result);
  return 0;
}
EOF

cat >> setlocale_utf8.c <<EOF
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
char bokmal_utf8[] = "bokmÃl";
int
main (void)
{
  char *result;
  result = setlocale(LC_ALL, bokmal_utf8);
  if (result == NULL)
    {
      perror ("setlocale");
      exit(1);
    }
  printf ("setlocale() = %s\n", result);
  return 0;
}

cat >> build.sh <<EOF
gcc -Wall -pedantic -O0 -g3 -o setlocale_fail setlocale_utf8.c
gcc -Wall -pedantic -O0 -g3 -o setlocale_pass setlocale_iso88591.c
EOF

chmod u+x build.sh
./build.sh

[carlos@athas setlocale]$ ./setlocale_fail 
setlocale: No such file or directory
[carlos@athas setlocale]$ ./setlocale_pass 
setlocale() = bokmïl

The literal bytes passed to setlocale for the name of the locale
must be in ISO-8859-1 in order to be identified as the
nb_NO.ISO-8859-1 locale that is eventually loaded.

This means that changing the output encoding from `locale -a`
would break programs trying to use that output to set a 
locale.

Using `locale -a -v` you can see that it's an ISO-8859-1 locale,
and surmise the name of the locale is encoded in ISO-8859-1, and
that you need to convert it to display it in UTF8 correctly.

e.g.
locale: bokm<E5>l          archive: /usr/lib/locale/locale-archive
-------------------------------------------------------------------------------
    title | Norwegian (Bokmal) locale for Norway
   source | Norsk Standardiseringsforbund
  address | University Library, Drammensveien 41, N-9242 Oslo, Norge
    email | bug-glibc-locales@gnu.org
 language | Norwegian, Bokm<E5>l
territory | Norway
 revision | 1.0
     date | 2000-06-29
  codeset | ISO-8859-1

In summary:

The output of `locale -a` could be in mixed encodings.

The locale name must be exactly as `locale -a` prints it for it
to work with setlocale(), those exact bytes.

You can't easily use grep to process the output of `locale -a`.

We should stop using aliases that are anything but ASCII to avoid
future problems.

Questions:

Can we make this any better?

Cheers,
Carlos.

Follow-Ups:
- Re: Output of `locale -a` could be in mixed encodings?
  - From: Paul Eggert
- Re: Output of `locale -a` could be in mixed encodings?
  - From: Joseph Myers
- Re: Output of `locale -a` could be in mixed encodings?
  - From: Martin Sebor

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]