This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Should glibc provide a builtin C.UTF-8 locale?

From: keld at keldix dot com
To: Rich Felker <dalias at libc dot org>
Cc: Carlos O'Donell <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
Date: Thu, 12 Feb 2015 20:25:44 +0100
Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
Authentication-results: sourceware.org; auth=none
References: <54DB8243 dot 3050903 at redhat dot com> <20150211235304 dot GA20330 at www5 dot open-std dot org> <20150212023923 dot GP23507 at brightrain dot aerifal dot cx> <20150212063839 dot GA10787 at www5 dot open-std dot org> <20150212151509 dot GQ23507 at brightrain dot aerifal dot cx>

On Thu, Feb 12, 2015 at 10:15:09AM -0500, Rich Felker wrote:
> On Thu, Feb 12, 2015 at 07:38:40AM +0100, keld@keldix.com wrote:
> > On Wed, Feb 11, 2015 at 09:39:23PM -0500, Rich Felker wrote:
> > > On Thu, Feb 12, 2015 at 12:53:04AM +0100, keld@keldix.com wrote:
> > > > On Wed, Feb 11, 2015 at 11:24:35AM -0500, Carlos O'Donell wrote:
> > > > > Is anyone opposed to having glibc contain a builtin C.UTF-8 locale?
> > > > > This locale would have the same rules as the C locale when set for
> > > > > LC_ALL.
> > > > > 
> > > > > The locale would provide sensible fallback for developers that need
> > > > > UTF-8 but until C.UTF-8 was provided, could not rely upon it.
> > > > > 
> > > > > My best guess is that it will take ~1.5MB of data to include the
> > > > > UTF-8 locale in the runtime. If you do it right this is shared
> > > > > for all processes, and give you, in this the 20th century, a fallback
> > > > > that is sensible for all developers of all languages.
> > > > > 
> > > > > We have had on-and-off requests for this for years as UTF-8 has become
> > > > > the defacto standard.
> > > > > 
> > > > > The most recent request is from the Python 3 folks who want to be able
> > > > > to assume there is some kind of UTF-8 support in the system regardless
> > > > > of the installed locales.
> > > > > 
> > > > > Is this the right way forward? Or should we tell the distributions
> > > > > that it is their responsibility to ship and always provide a C.UTF-8?
> > > > 
> > > > I think it is a good way forward. It should probably be the "i18n" locale
> > > > of ISO 30112 that is the base, the "i18n" locale is built directly on glibc data.
> > > > 
> > > > A lot of optimisation could be done on the data with two-level or more tables,
> > > > giving special data where the data is not well-formed for algoritmic 
> > > > handling, noting that some case mappings are not suited for algoritmic handling.
> > > > Some properties tables are well suited for combined bitmap handling and index handling.
> > > > Collating tables could possibly also be optimized by multilevel tables.
> > > > 
> > > > Also a pet idea of mine is to have compressed locales - that could significantly reduce
> > > > the disk footprint of a more complete locale database. Also good for message catalogues.
> > > 
> > > This sounds like a bad tradeoff unless you can use the compressed data
> > > efficiently in-place. Disk space is cheap; requiring a decompressed
> > > copy in memory per-process rather than using a shared mapping is
> > > expensive.
> > 
> > Hmm, are you referring to a statically linked version in glibc when you talk about
> > a shared mapping? 
> > 
> > I do not see the big difference between loading an uncompressed locale and loading
> > a compressed locale into memory, it may even be faster to read the compressed data 
> > and uncompress it. Or what?
> > 
> > Message catalogues may be huge, especially if you want to carry them all.
> 
> The difference with the uncompressed locale archive is that it's NOT
> loaded into memory, it's mmapped, just like executables and shared
> libraries are. This means that only the used parts are ever resident
> in memory at all, they're discardable (subject to reloading later on
> the next access) just like anything else in the filesystem cache, and
> shared by all processes using glibc.

I see. Are message catalogues also mmapped?

Best regards
keld

Follow-Ups:
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Carlos O'Donell
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Rich Felker

References:
- Should glibc provide a builtin C.UTF-8 locale?
  - From: Carlos O'Donell
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: keld
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Rich Felker
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: keld
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Rich Felker

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]