This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Should glibc provide a builtin C.UTF-8 locale?
- From: Rich Felker <dalias at libc dot org>
- To: keld at keldix dot com
- Cc: Carlos O'Donell <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 11 Feb 2015 21:39:23 -0500
- Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
- Authentication-results: sourceware.org; auth=none
- References: <54DB8243 dot 3050903 at redhat dot com> <20150211235304 dot GA20330 at www5 dot open-std dot org>
On Thu, Feb 12, 2015 at 12:53:04AM +0100, keld@keldix.com wrote:
> On Wed, Feb 11, 2015 at 11:24:35AM -0500, Carlos O'Donell wrote:
> > Is anyone opposed to having glibc contain a builtin C.UTF-8 locale?
> > This locale would have the same rules as the C locale when set for
> > LC_ALL.
> >
> > The locale would provide sensible fallback for developers that need
> > UTF-8 but until C.UTF-8 was provided, could not rely upon it.
> >
> > My best guess is that it will take ~1.5MB of data to include the
> > UTF-8 locale in the runtime. If you do it right this is shared
> > for all processes, and give you, in this the 20th century, a fallback
> > that is sensible for all developers of all languages.
> >
> > We have had on-and-off requests for this for years as UTF-8 has become
> > the defacto standard.
> >
> > The most recent request is from the Python 3 folks who want to be able
> > to assume there is some kind of UTF-8 support in the system regardless
> > of the installed locales.
> >
> > Is this the right way forward? Or should we tell the distributions
> > that it is their responsibility to ship and always provide a C.UTF-8?
>
> I think it is a good way forward. It should probably be the "i18n" locale
> of ISO 30112 that is the base, the "i18n" locale is built directly on glibc data.
>
> A lot of optimisation could be done on the data with two-level or more tables,
> giving special data where the data is not well-formed for algoritmic
> handling, noting that some case mappings are not suited for algoritmic handling.
> Some properties tables are well suited for combined bitmap handling and index handling.
> Collating tables could possibly also be optimized by multilevel tables.
>
> Also a pet idea of mine is to have compressed locales - that could significantly reduce
> the disk footprint of a more complete locale database. Also good for message catalogues.
This sounds like a bad tradeoff unless you can use the compressed data
efficiently in-place. Disk space is cheap; requiring a decompressed
copy in memory per-process rather than using a shared mapping is
expensive.
Rich