This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Use Unicode code points for country_isbn

From: Keld Simonsen <keld at keldix dot com>
To: Joseph Myers <joseph at codesourcery dot com>
Cc: Marko Myllynen <myllynen at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>, libc-locales at sourceware dot org
Date: Fri, 24 Jul 2015 12:43:49 +0200
Subject: Re: [PATCH] Use Unicode code points for country_isbn
Authentication-results: sourceware.org; auth=none
References: <5571B8C2 dot 8000108 at redhat dot com> <20150609071130 dot GA26925 at domone> <5576BC13 dot 5020001 at redhat dot com> <20150721081840 dot GE12267 at vapier> <20150721084006 dot GB29742 at www5 dot open-std dot org> <20150721092217 dot GG12267 at vapier> <20150721115852 dot GA24115 at rap dot rap dot dk> <alpine dot DEB dot 2 dot 10 dot 1507221719420 dot 21570 at digraph dot polyomino dot org dot uk> <20150722190228 dot GA18489 at www5 dot open-std dot org> <alpine dot DEB dot 2 dot 10 dot 1507221951100 dot 19567 at digraph dot polyomino dot org dot uk>

On Wed, Jul 22, 2015 at 08:02:23PM +0000, Joseph Myers wrote:
> On Wed, 22 Jul 2015, Keld Simonsen wrote:
> 
> > > On the build system on which glibc is built, we can always assume that the 
> > > glibc sources are the exact sequences of octets provided by the glibc 
> > > project, not converted into another character set and without any 
> > > conversions of line endings.  Furthermore, on any system using glibc and 
> > > executing tools such as localedef with the installed locale source files, 
> > > it can be assumed that those source files are the files shipped with 
> > > glibc, not those files after conversion into another character set.  Use 
> > > of glibc source files after conversion into another character set is 
> > > outside the scope of the glibc project - glibc is not expected to build 
> > > with such converted source files.
> > 
> > Sounds strange. glibc is the library for the GNU C language. Standard 
> 
> No it's not.  It's the C library for the GNU system.  glibc has a range of 
> requirements, including ELF, TLS, an MMU, two's complement integers, 
> 32-bit int, 32-bit or 64-bit long, 32-bit UTF-32 wchar_t, IEEE binary32 
> float, IEEE binary64 double, various GNU tools present on the build system 
> as documented in install.texi, ....

Yes, understood, but I don't think any of these requirements influenses the 
locales part.

> > ISO C is coded character set independent, as is also POSIX. Why would 
> > the glibc project not follow ISO C and POSIX design goals? Why would 
> 
> Because glibc makes particular implementation choices in areas that are 
> implementation-defined.  It's an implementation, not a meta-implementation 
> that tries to cover the range of permitted implementation choices.  
> Meta-implementations (at least of the language part of ISO C) exist, but 
> they exist in the field of formal systems used to reason about C programs.

I am also active in C standardization. I think it is a good goal to not
deviate and restrict an implementalton more than necessary. And at least 
not restrict it further than already implemented. That would lead to a loss
of functionality.


> > glibc exclude itself from Apple and Microsoft (utf16) and non-utf8 Linux 
> > and UNIX systems?
> 
> It's about 15-20 years since glibc was usable as a replacement C library 
> for systems with an existing native non-free C library.  Those systems are 
> not relevant to glibc nowadays (Apple and Microsoft systems fail the basic 
> requirement of using ELF, which is assumed all over glibc).  UTF-16 is 
> supported in iconv (only), just like EBCDIC.  Non-UTF-8 locales are 
> supported, but deprecated (new non-UTF-8 locales should not be added, and 
> any existing non-UTF-8 locales should have a UTF-8 counterpart), and to be 
> usable in a POSIX-compliant way must have a character set that includes 
> ASCII.

I thought cygwin was a GNU implementation for windows, and that it also
implemented glibc. I now understand that the cygwin libc is different from
glibc. But how different? Do they use glibc locales, or are they able to?

I would like the glibc locales to also be usable in other libc environments.
Most of all because they IMHO are the most comprehensive set of locales available.
So that would benefit users also outside glibc. Why not have this in mind
also for our project?

> Given sufficiently many GNU tools built on a non-GNU build system, it 
> should be possible to cross-compile glibc there - but localedef itself is 
> only ever linked against glibc and run on a system using glibc (the 
> cross-localedef functionality checked in to glibc is limited to allowing 
> one glibc system to generate locales for another system with the same 
> glibc version but a different endianness).
> 
> > > Now, it's true that the installed localedef utility should be usable in 
> > > locale A to generate locale B, for any pair (A, B) of installed locales - 
> > > rather than only being able to generate locales as part of the glibc build 
> > > / install process.  If localedef interprets locale sources in the 
> > > character set of the locale in which it runs, that may mean the installed 
> > > locale sources do need to be in ASCII.  How does localedef determine the 
> > > character set in which to interpret the textual locale source files?
> > 
> > Yes, that is why we use UCS symbolic code points. I would then rather to be
> 
> "Yes" does not answer my question about how localedef determines the 
> character set of its input.

My understanding is that the charset of the source is the charset of the locale
of the environment that localedef is running in. If the locale then is ASCII only
then there is no need for conversion of it - except for conversion 
into UTF16. Restricting the source further to invariant-ASCII also makes
the source portable to EBCDIC systems. Unicode restricts its sources to ASCII,
possibly also for this reason. Unicode do not publish their data in Unicode.

> > fully consistent use UCS symbolic code points all the way thru a locale 
> > source, it is a bit more cumbersome, but I would rather be consistent. 
> 
> I'd rather have some extension to allow a locale source file to declare 
> that it is in UTF-8, and then use UTF-8 throughout except for control 
> characters or combining characters used in isolation.

That would make it difficult to maintain in environments that is not using utf8.
Using ASCII only would make the locales maintainable on all systems.

Best regards
Keld

Follow-Ups:
- Re: [PATCH] Use Unicode code points for country_isbn
  - From: Joseph Myers

References:
- Re: [PATCH] Use Unicode code points for country_isbn
  - From: Mike Frysinger
- Re: [PATCH] Use Unicode code points for country_isbn
  - From: keld
- Re: [PATCH] Use Unicode code points for country_isbn
  - From: Mike Frysinger
- Re: [PATCH] Use Unicode code points for country_isbn
  - From: Keld Simonsen
- Re: [PATCH] Use Unicode code points for country_isbn
  - From: Joseph Myers
- Re: [PATCH] Use Unicode code points for country_isbn
  - From: Keld Simonsen
- Re: [PATCH] Use Unicode code points for country_isbn
  - From: Joseph Myers

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]