This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Should glibc provide a builtin C.UTF-8 locale?
- From: Mike FABIAN <mfabian at redhat dot com>
- To: keld at keldix dot com
- Cc: "Carlos O'Donell" <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>, Pravin Satpute <psatpute at redhat dot com>, Jens Petersen <petersen at redhat dot com>
- Date: Tue, 27 Oct 2015 13:22:34 +0100
- Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
- Authentication-results: sourceware.org; auth=none
- References: <54DB8243 dot 3050903 at redhat dot com> <20151021174936 dot GA26317 at vapier dot lan> <5627DAAE dot 8060703 at redhat dot com> <20151021205540 dot GA30739 at www5 dot open-std dot org>
keld@keldix.com wrote:
> On Wed, Oct 21, 2015 at 02:34:22PM -0400, Carlos O'Donell wrote:
>> On 10/21/2015 01:49 PM, Mike Frysinger wrote:
>>
>> Looks good to me.
>>
>> Fedora support is here:
>> http://pkgs.fedoraproject.org/cgit/glibc.git/commit/?id=0457f649e3fe6299efe384da13dfc923bbe65707
>>
>> Patch for C.UTF-8:
>> https://lists.fedoraproject.org/pipermail/glibc/2015-September/000081.html
>>
>> The patch lists some if the differences between debian and fedora.
>>
>> We are testing C.UTF-8 in rawhide and I expect Mike FABIAN wants to
>> submit this upstream at some point to become the official C.UTF-8
>> locale, but we'd also like to harmonize with the distros if there
>> is anything we aren't doing right.
>
> What is the intended differnce between this locale and the i18n locale
> of ISO TR 30112?
Do you mean the difference between C.UTF-8 and the "i18n.UTF-8" locale
defined in glibc/localedata/locales/i18n?
(i.e. a i18n.UTF-8 localed created with
localedef --no-archive -ci i18n -f UTF-8 /usr/lib/locale/i18n.utf8
)
"C.UTF-8" tries to be the same as C/POSIX wherever possible,
it only uses UTF-8 encoding and extends the supported
character range to all of Unicocde. "i18n.UTF-8" has more
differences to C/POSIX than that.
Differences in detail are:
LC_CTYPE
almost the same
- C.UTF-8 just copies the LC_CTYPE from "i18n" (Which is kept
in sync with the latest Unicode release using some scripts) and
adds "translit_combining".
LC_COLLATE
- C.UTF-8 sorts via Unicode code point order.
For the ASCII range that is the same order as the C/POSIX locale
so this gives the traditional sorting for the ASCII range.
- i18n sorts according to ISO 14651 which is default Unicode
collation order. That happens to be the same in the ASCII range.
Locales like en_GB and en_US which just copy the ISO/IEC 14651
template
LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
END LC_COLLATE
sort lower case letters before upper case letters. But
i18n.UTF-8 does some extra stuff before copying the template
which fixes this. So i18n.UTF-8 sorts the same way in the
ASCII range.
Do we care how a C.UTF-8 locale sorts outside of the ASCII range?
If we do not care much, Unicode code point order is an easy
way to get a consistent order. On the other hand this order
is sometimes not really useful. Would it be better to sort
according to the default Unicode collation order for characters
outside of the ASCII range???
LC_MONETARY
- C.UTF-8 tries to agree with C/POSIX as much as possible
and thus uses "USD" for int_curr_symbol, "$" for currency_symbol,
and "." for mon_decimal_point.
- i18n.UTF-8 uses "XDR" for int_curr_symbol
(special drawing right issued by the International Monetary Fund,
see https://en.wikipedia.org/wiki/ISO_4217), "Â" for currency_symbol,
and "," for mon_decimal_point
LC_NUMERIC
- C.UTF-8 uses "." for decimal_point (like C/POSIX)
- i18n.UTF-8 uses "," for decimal_point
LC_TIME
- C.UTF-8 uses the same as C/POSIX
- i18n.UTF-8 uses something "more international, less English"
for example the month and day names are just numbers in
i18n.UTF-8:
$ LC_ALL=C.UTF-8 date
Tue Oct 27 10:28:44 CET 2015
$ LC_ALL=i18n.UTF-8 date
3 10 27 10:29:02 CET 2015
$
LC_MESSAGES
- C.UTF-8 uses the same as C/POSIX
(for example yesexpr "^[yY]" and noexpr "^[nN]"
- i18n.UTF-8 apparently tries to avoid English
(for example yesexpr "^[+1]" and noexpr "^[-0]")
LC_PAPER
No difference between C.UTF-8 and i18n.UTF-8, both use
A4 paper, just like C/POSIX.
LC_NAME
No difference between C.UTF-8, i18n.UTF-8, and C/POSIX
LC_ADDRESS
No difference between C.UTF-8, i18n.UTF-8, and C/POSIX
LC_TELEPHONE
C.UTF-8 has tel_int_fmt "+%c %a %l" (same as C/POSIX)
i18n.UTF-8 has tel_int_fmt "+%c +a +l" <-- that looks like a bug, doesnât it??
LC_MEASUREMENT
No difference between C.UTF-8, i18n.UTF-8, and C/POSIX,
all muse metric measurement.
--
Mike FABIAN <mfabian@redhat.com>