This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Should glibc provide a builtin C.UTF-8 locale?

From: Rich Felker <dalias at libc dot org>
To: keld at keldix dot com
Cc: Carlos O'Donell <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
Date: Thu, 12 Feb 2015 10:15:09 -0500
Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
Authentication-results: sourceware.org; auth=none
References: <54DB8243 dot 3050903 at redhat dot com> <20150211235304 dot GA20330 at www5 dot open-std dot org> <20150212023923 dot GP23507 at brightrain dot aerifal dot cx> <20150212063839 dot GA10787 at www5 dot open-std dot org>

On Thu, Feb 12, 2015 at 07:38:40AM +0100, keld@keldix.com wrote:
> On Wed, Feb 11, 2015 at 09:39:23PM -0500, Rich Felker wrote:
> > On Thu, Feb 12, 2015 at 12:53:04AM +0100, keld@keldix.com wrote:
> > > On Wed, Feb 11, 2015 at 11:24:35AM -0500, Carlos O'Donell wrote:
> > > > Is anyone opposed to having glibc contain a builtin C.UTF-8 locale?
> > > > This locale would have the same rules as the C locale when set for
> > > > LC_ALL.
> > > > 
> > > > The locale would provide sensible fallback for developers that need
> > > > UTF-8 but until C.UTF-8 was provided, could not rely upon it.
> > > > 
> > > > My best guess is that it will take ~1.5MB of data to include the
> > > > UTF-8 locale in the runtime. If you do it right this is shared
> > > > for all processes, and give you, in this the 20th century, a fallback
> > > > that is sensible for all developers of all languages.
> > > > 
> > > > We have had on-and-off requests for this for years as UTF-8 has become
> > > > the defacto standard.
> > > > 
> > > > The most recent request is from the Python 3 folks who want to be able
> > > > to assume there is some kind of UTF-8 support in the system regardless
> > > > of the installed locales.
> > > > 
> > > > Is this the right way forward? Or should we tell the distributions
> > > > that it is their responsibility to ship and always provide a C.UTF-8?
> > > 
> > > I think it is a good way forward. It should probably be the "i18n" locale
> > > of ISO 30112 that is the base, the "i18n" locale is built directly on glibc data.
> > > 
> > > A lot of optimisation could be done on the data with two-level or more tables,
> > > giving special data where the data is not well-formed for algoritmic 
> > > handling, noting that some case mappings are not suited for algoritmic handling.
> > > Some properties tables are well suited for combined bitmap handling and index handling.
> > > Collating tables could possibly also be optimized by multilevel tables.
> > > 
> > > Also a pet idea of mine is to have compressed locales - that could significantly reduce
> > > the disk footprint of a more complete locale database. Also good for message catalogues.
> > 
> > This sounds like a bad tradeoff unless you can use the compressed data
> > efficiently in-place. Disk space is cheap; requiring a decompressed
> > copy in memory per-process rather than using a shared mapping is
> > expensive.
> 
> Hmm, are you referring to a statically linked version in glibc when you talk about
> a shared mapping? 
> 
> I do not see the big difference between loading an uncompressed locale and loading
> a compressed locale into memory, it may even be faster to read the compressed data 
> and uncompress it. Or what?
> 
> Message catalogues may be huge, especially if you want to carry them all.

The difference with the uncompressed locale archive is that it's NOT
loaded into memory, it's mmapped, just like executables and shared
libraries are. This means that only the used parts are ever resident
in memory at all, they're discardable (subject to reloading later on
the next access) just like anything else in the filesystem cache, and
shared by all processes using glibc.

Rich

Follow-Ups:
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: keld

References:
- Should glibc provide a builtin C.UTF-8 locale?
  - From: Carlos O'Donell
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: keld
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Rich Felker
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: keld

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]