This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: locale encodings

From: "Carlos O'Donell" <carlos at redhat dot com>
To: Steven Abner <pheonix at zoomtown dot com>, libc-locales at sourceware dot org
Date: Mon, 11 Nov 2013 00:19:32 -0500
Subject: Re: locale encodings
Authentication-results: sourceware.org; auth=none
References: <31AACAB8-A716-47CC-B755-F33DD77BA51E at zoomtown dot com>

On 11/10/2013 07:03 PM, Steven Abner wrote:
> Hi, Can you tell me what file format "cs_CZ", "sk_SK", "sv_SE" and
> "wo_SN" are encoded in? I was going to try to fix it for my use, but
> can't open in a normal editor. I was doing a design test when these
> files tripped a non-POSIX portable character set code in my scanf()'s
> isspace(). I think they might be ISO8859-2 but not sure. Normal
> editor claims it can't be open in UTF-8. I'd rather not second guess
> someone else's work, if I can. If it is  ISO8859-2, I'll just
> decode/encode me a UTF file to examine. Two other files have UTF8
> encodings, which is no problem. Others do but weren't within scope
> of the trap (comment character to first word after). I am only trying
> to verify the file parser is picking up exact data, and hopefully not
> being corrupted by unusual codes, as some have been.

No idea and I've been around with the projct for a long time.

Some of these files are quire historical and we didn't have all
of the tools we do today. The goal today is that everything 
should be UTF-8, but they are not.

I tried chardet and it says MacCyrillic:
Python 2.7.5 (default, Oct  8 2013, 12:19:40) 
[GCC 4.8.1 20130603 (Red Hat 4.8.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> rawdata = open ("localedata/locales/cs_CZ", "r").read()
>>> result = chardet.detect(rawdata)
>>> charenc = result['encoding']
>>> print result
{'confidence': 0.7721607087786949, 'encoding': 'MacCyrillic'}

It would be great to have these properly encoded into UTF-8.

I would accept patches to do so unless someone says it *can't
be encoded in UTF-8 (which I would find very odd).

Cheers,
Carlos.

References:
- locale encodings
  - From: Steven Abner

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]