This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Output of `locale -a` could be in mixed encodings?


On 01/20/2015 06:38 PM, Carlos O'Donell wrote:
I'm going to ramble a bit here because the problem is rambling.

The output of `locale -a` can't be easily grepped.

[carlos@athas intl]$ locale -a | grep bok
Binary file (standard input) matches

The name of various localizations are written in their respective
encodings e.g. ISO-8859-1.

Thus the Bokmal name is output in ISO-8859-1 along with an ASCII
version. This makes it difficult to use grep to parse `locale -a`
output in anything but ISO-8859-1.

e.g.
[carlos@athas intl]$ export LANG=C
[carlos@athas intl]$ locale -a | grep bok
bokmal
bokmïl

Alternatively, the GNU grep -a option works:

$ LC_ALL=$(locale -a | grep -a bokm | tail -n1) locale | grep -a LC_ALL
LC_ALL=bokmïl


A naive fix is for `locale` to examine the present locale and
use iconv to convert the names to the target locale. So for example
if the user is using en_US.UTF8 then the above would get converted
to:

I'm not sure if POSIX intends to allow that when the -a option
is used or whether what the implementation does in that case
is unspecified:

  The application shall ensure that the LANG, LC_* , and [XSI]
  [Option Start] NLSPATH [Option End]  environment variables
  specify the current locale environment to be written out;
  they shall be used if the -a option is not specified.

Perhaps because (as you noted) converting the string to some
other encoding would make it unusable as the name of the same
locale.

In summary:

The output of `locale -a` could be in mixed encodings.

The locale name must be exactly as `locale -a` prints it for it
to work with setlocale(), those exact bytes.

You can't easily use grep to process the output of `locale -a`.

We should stop using aliases that are anything but ASCII to avoid
future problems.

This seems like the same problem as with file names that
contain non-ASCII characters. The only robust solution is
to avoid using such characters. Both by the implementation
and by applications (e.g., in locale names created by users
via the localedef utility).

Martin


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]