This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Should glibc provide a builtin C.UTF-8 locale?
- From: Rich Felker <dalias at libc dot org>
- To: Carlos O'Donell <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Thu, 22 Oct 2015 12:25:47 -0400
- Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
- Authentication-results: sourceware.org; auth=none
- References: <54DB8243 dot 3050903 at redhat dot com> <20151021174936 dot GA26317 at vapier dot lan> <20151022151322 dot GW8645 at brightrain dot aerifal dot cx> <20151022160132 dot GM26317 at vapier dot lan>
On Thu, Oct 22, 2015 at 12:01:32PM -0400, Mike Frysinger wrote:
> On 22 Oct 2015 11:13, Rich Felker wrote:
> > On Wed, Oct 21, 2015 at 01:49:36PM -0400, Mike Frysinger wrote:
> > > i've created a C.UTF-8 page where i've tried to gather all the points
> > > people made in this thread:
> > > https://sourceware.org/glibc/wiki/Proposals/C.UTF-8
> >
> > On the wiki I see, under differences from C:
> >
> > - LC_COLLATE: Sort using the Unicode codepoint
> >
> > But this does not seem to be a difference. Unicode codepoint order is
> > identical to UTF-8 code unit order as unsigned char, i.e. the same as
> > the C locale.
>
> i was thinking of overlong encodings, but i guess those are technically
> invalid according to the spec. i think it's still worth calling out in
> the doc, but we can include an aside that highlights things.
> -mike
Those are just one case of non-UTF-8 sequences. I would assume you'd
still want them to sort like they would in the C locale just to
preserve the total order. Having all illegal sequences compare equal
to each other, or having them compare equal to valid sequences, would
be problematic for users and more work to implement.
Rich