This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Should glibc provide a builtin C.UTF-8 locale?

From: Mike FABIAN <mfabian at redhat dot com>
To: keld at keldix dot com
Cc: "Carlos O'Donell" <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>, Pravin Satpute <psatpute at redhat dot com>, Jens Petersen <petersen at redhat dot com>
Date: Tue, 27 Oct 2015 13:22:34 +0100
Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
Authentication-results: sourceware.org; auth=none
References: <54DB8243 dot 3050903 at redhat dot com> <20151021174936 dot GA26317 at vapier dot lan> <5627DAAE dot 8060703 at redhat dot com> <20151021205540 dot GA30739 at www5 dot open-std dot org>

keld@keldix.com wrote:

> On Wed, Oct 21, 2015 at 02:34:22PM -0400, Carlos O'Donell wrote:
>> On 10/21/2015 01:49 PM, Mike Frysinger wrote:
>>
>> Looks good to me.
>> 
>> Fedora support is here:
>> http://pkgs.fedoraproject.org/cgit/glibc.git/commit/?id=0457f649e3fe6299efe384da13dfc923bbe65707
>> 
>> Patch for C.UTF-8:
>> https://lists.fedoraproject.org/pipermail/glibc/2015-September/000081.html
>> 
>> The patch lists some if the differences between debian and fedora.
>> 
>> We are testing C.UTF-8 in rawhide and I expect Mike FABIAN wants to
>> submit this upstream at some point to become the official C.UTF-8
>> locale, but we'd also like to harmonize with the distros if there
>> is anything we aren't doing right.
>
> What is the intended differnce between this locale and the i18n locale
> of ISO TR 30112?

Do you mean the difference between C.UTF-8 and the "i18n.UTF-8" locale
defined in glibc/localedata/locales/i18n?

(i.e. a i18n.UTF-8 localed created with

    localedef --no-archive -ci i18n -f UTF-8 /usr/lib/locale/i18n.utf8
)

"C.UTF-8" tries to be the same as C/POSIX wherever possible,
it only uses UTF-8 encoding and extends the supported
character range to all of Unicocde. "i18n.UTF-8" has more
differences to C/POSIX than that.

Differences in detail are:

LC_CTYPE
   almost the same
   - C.UTF-8 just copies the LC_CTYPE from "i18n" (Which is kept
     in sync with the latest Unicode release using some scripts) and
     adds "translit_combining".

LC_COLLATE
   - C.UTF-8 sorts via Unicode code point order.
     For the ASCII range that is the same order as the C/POSIX locale
     so this gives the traditional sorting for the ASCII range.
   - i18n sorts according to ISO 14651 which is default Unicode
     collation order. That happens to be the same in the ASCII range.
     Locales like en_GB and en_US which just copy the  ISO/IEC 14651
     template
     
        LC_COLLATE
        % Copy the template from ISO/IEC 14651
        copy "iso14651_t1"
        END LC_COLLATE

     sort lower case letters before upper case letters. But
     i18n.UTF-8 does some extra stuff before copying the template
     which fixes this. So i18n.UTF-8 sorts the same way in the
     ASCII range.
     
     Do we care how a C.UTF-8 locale sorts outside of the ASCII range?
     If we do not care much, Unicode code point order is an easy
     way to get a consistent order. On the other hand this order
     is sometimes not really useful. Would it be better to sort
     according to the default Unicode collation order for characters
     outside of the ASCII range???

LC_MONETARY
   - C.UTF-8 tries to agree with C/POSIX as much as possible
     and thus uses "USD" for int_curr_symbol, "$" for currency_symbol,
     and "." for mon_decimal_point.
   - i18n.UTF-8 uses "XDR" for int_curr_symbol
     (special drawing right issued by the International Monetary Fund,
     see https://en.wikipedia.org/wiki/ISO_4217), "Â" for currency_symbol,
     and "," for mon_decimal_point

LC_NUMERIC
   - C.UTF-8 uses "." for decimal_point (like C/POSIX)
   - i18n.UTF-8 uses "," for decimal_point

LC_TIME
   - C.UTF-8 uses the same as C/POSIX
   - i18n.UTF-8 uses something "more international, less English"
     for example the month and day names are just numbers in
     i18n.UTF-8:
        $ LC_ALL=C.UTF-8 date
        Tue Oct 27 10:28:44 CET 2015
        $ LC_ALL=i18n.UTF-8 date
        3 10 27 10:29:02 CET 2015
        $ 

LC_MESSAGES
   - C.UTF-8 uses the same as C/POSIX
     (for example yesexpr "^[yY]" and noexpr "^[nN]"
   - i18n.UTF-8 apparently tries to avoid English
     (for example yesexpr  "^[+1]" and noexpr "^[-0]")

LC_PAPER
   No difference between C.UTF-8 and i18n.UTF-8, both use
   A4 paper, just like C/POSIX.

LC_NAME
   No difference between C.UTF-8, i18n.UTF-8, and C/POSIX

LC_ADDRESS
   No difference between C.UTF-8, i18n.UTF-8, and C/POSIX

LC_TELEPHONE
   C.UTF-8 has tel_int_fmt "+%c %a %l" (same as C/POSIX)
   i18n.UTF-8 has tel_int_fmt "+%c +a +l" <-- that looks like a bug, doesnât it??

LC_MEASUREMENT
   No difference between C.UTF-8, i18n.UTF-8, and C/POSIX,
   all muse metric measurement.

-- 
Mike FABIAN <mfabian@redhat.com>

Follow-Ups:
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: keld
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Paul Eggert
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Florian Weimer
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Rich Felker

References:
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Mike Frysinger
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Carlos O'Donell
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: keld

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]