This is the mail archive of the
libc-locales@sourceware.org
mailing list for the GNU libc locales project.
Re: locale encodings
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: Steven Abner <pheonix at zoomtown dot com>, libc-locales at sourceware dot org
- Date: Mon, 11 Nov 2013 00:19:32 -0500
- Subject: Re: locale encodings
- Authentication-results: sourceware.org; auth=none
- References: <31AACAB8-A716-47CC-B755-F33DD77BA51E at zoomtown dot com>
On 11/10/2013 07:03 PM, Steven Abner wrote:
> Hi, Can you tell me what file format "cs_CZ", "sk_SK", "sv_SE" and
> "wo_SN" are encoded in? I was going to try to fix it for my use, but
> can't open in a normal editor. I was doing a design test when these
> files tripped a non-POSIX portable character set code in my scanf()'s
> isspace(). I think they might be ISO8859-2 but not sure. Normal
> editor claims it can't be open in UTF-8. I'd rather not second guess
> someone else's work, if I can. If it is ISO8859-2, I'll just
> decode/encode me a UTF file to examine. Two other files have UTF8
> encodings, which is no problem. Others do but weren't within scope
> of the trap (comment character to first word after). I am only trying
> to verify the file parser is picking up exact data, and hopefully not
> being corrupted by unusual codes, as some have been.
No idea and I've been around with the projct for a long time.
Some of these files are quire historical and we didn't have all
of the tools we do today. The goal today is that everything
should be UTF-8, but they are not.
I tried chardet and it says MacCyrillic:
Python 2.7.5 (default, Oct 8 2013, 12:19:40)
[GCC 4.8.1 20130603 (Red Hat 4.8.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> rawdata = open ("localedata/locales/cs_CZ", "r").read()
>>> result = chardet.detect(rawdata)
>>> charenc = result['encoding']
>>> print result
{'confidence': 0.7721607087786949, 'encoding': 'MacCyrillic'}
It would be great to have these properly encoded into UTF-8.
I would accept patches to do so unless someone says it *can't
be encoded in UTF-8 (which I would find very odd).
Cheers,
Carlos.