This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Should glibc provide a builtin C.UTF-8 locale?
- From: Rich Felker <dalias at libc dot org>
- To: Carlos O'Donell <carlos at redhat dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 11 Feb 2015 15:43:10 -0500
- Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
- Authentication-results: sourceware.org; auth=none
- References: <54DB8243 dot 3050903 at redhat dot com> <20150211173024 dot GZ23507 at brightrain dot aerifal dot cx> <54DB93E2 dot 8000106 at redhat dot com> <20150211194035 dot GH23507 at brightrain dot aerifal dot cx> <54DBBB1E dot 5090802 at redhat dot com>
On Wed, Feb 11, 2015 at 03:27:10PM -0500, Carlos O'Donell wrote:
> > If you do go with the "two projects" approach, perhaps you could aim
> > to have them both take place in the same release cycle, or else to
> > have "built-in C.UTF-8 locale" be an optional feature until it's
> > optimized in a subsequent release.
>
> No, you raise a very good point. I'll make sure they go forward as one
> project with an analysis phase that requires looking at how to reduce
> the tables sizes.
Some preliminary remarks on this:
For character classes, basic two-level tables indexed by x>>8 and
x&255 work wonders; the second-level tables can be shared between many
blocks. But in order for this sharing to work best, you need separate
1-bit-per-char tables for each property rather than having all
properties together.
Some of the classes like digit/xdigit can be hard-coded not to need
tables at all; this should be a performance optimization too.
Case mapping could probably also be done with two-level tables. I
think it would be considerably larger than property bits, but probably
still under 64k for each direction.
My rough educated guess is that you could get the whole thing in ~160k
without compromising performance.
Also it's conceivable that locales that agree with C.UTF-8 on
character classification and case mappings could use the built-in data
rather than requiring data in the locale database. This could greatly
reduce the installed size of the latter and might improve performance.
Rich