This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: gcc ignores locale (no UTF-8 source code supported)

To: Markus dot Kuhn at cl dot cam dot ac dot uk
Subject: Re: gcc ignores locale (no UTF-8 source code supported)
From: "Martin v. Loewis" <martin at loewis dot home dot cs dot tu-berlin dot de>
Date: Sat, 23 Sep 2000 21:17:35 +0200
CC: libc-alpha at sources dot redhat dot com, gcc at gcc dot gnu dot org
References: <E13csGm-0004jc-00@wisbech.cl.cam.ac.uk>

> POSIX specifies already what the "implementation-defined way of
> determining the source character set" is that the C standard refers
> to.

Can you please point to the exact chapter and verse of Posix that
specifies that the C compiler must consider environment variables when
reading source code?

> First of all: Encodings are trivially to convert into each other (simply
> use iconv, recode, etc.). Users on POSIX systems have to make an effort
> to keep all their files in the same encoding, namely the encoding
> specified in their locale. 

Users may not have the administrative permissions to do so: Most users
can not modify the files in /usr/include, for example.

> The fathers of Unix have already decided back in 1992 (Plan9) that
> this is the only real way to go and I hope the GNU/ Linux world will
> follow soon.

I can easily emagine that gcc supports a -futf-8 option some day (or
-fencoding=utf8). I hope it will never consider LANG when reading
source code, though. That is evil.

> The authors of the C standard made it very clear that they want to
> support the ISO 10646 repertoire in source code, and I hope that this
> will soon become common practice.

The authors also made it pretty clear that any mechanism except for
universal-character-names will be implementation-defined, and cannot
hope to be portable across implementations. Therefore, authors of
portable software should not make use of such a mechanism.

> > wprintf(L"Sch\u00f6ne Gr\u00FC\u00DFe!\n");
> 
> Please not!!! If I run on a beautiful modern system with full UTF-8
> support, then I definitely want to make full use of this encoding in my
> development environment. 

I guess you don't type UTF-8 bytes byte-for-byte into the files;
instead, your editor is capable of producing them on a key
stroke. Just tell your beautiful modern system to produce
universal-character-names when you type the keys. So the line above
would *display* with umlauts, even though the file uses a MBCS
encoding (namely, \u escapes).

An advanced editor (such as Emacs) is capable of dealing with multiple
encodings, it certainly could associate C files (and C++ and Java and
Tcl) with an encoding unicode-escape or such. Maybe it is time to
further improve your system.

> You must not confuse the emergency hack (hex fallbacks) with the
> daily usage on modern systems (UTF-8).

Why is one multibyte encoding capable of expressing full Unicode
(UTF-8) more modern than another one (universal character names)?

> Gettext() makes only sense if support of multi-lingual messages is a
> requirement. If I am a Thai student writing UTF-8 C source code for a
> Thai programming class, then I want to use the Thai alphabet in
> variables, comments, and wide-string literals just like you use ASCII.

Sure. Just use the right text editor - not one that produces UTF-8,
but one that produces universal character names. That way, you can
have all the features you want, *and* your code will compile even if
you take it with you when hired by a German company.

>   a) people will use lots of non-ASCII text in C source code (even
>      English-speaking people will find en/em-dashes, curly quotation marks
>      and mathematical symbols a highly desirable extension beyond ASCII)

Certainly, although the barrier is high even if the technical problem
where solved: Keywords in English don't mix well with non-English
identifiers, and corporate style may require all technical
documentation (including comments) to be in English.

>   b) people will prefer to have these characters UTF-8 encoded in their
>      development environment such that they see in the text editor the
>      actual characters and not the hex fallback

People won't care about encodings as long as it works.

>   c) people will find it trivial to use a 5-line Perl script to
>      convert L"Schöne Grüße!\n" into L"Sch\u00f6ne Gr\u00FC\u00DFe!\n"
>      in case they encounter a (hopefully soon very rare) environment
>      that can't handle ISO 10646 characters. 

The environment not supporting ISO 10646 characters won't support the
universal character names, either. They are just two encodings of ISO
10646 - and one of them happens to be mandated by the language
standards (ISO 9899 and ISO 14882), while the other is not.

It will be trivial to convert the source, yes - but it would be even
better if editors supported them in the first place.

> Please please treat L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" as something as
> ugly and hopefully unnecessary as trigraphs, not as common or even
> recommendable practice! 

Well, I want editors to support that. Until I give up on that, I'll
continue to recommend that - especially as more and more languages
support that as a means of putting Unicode into source code.

> Otherwise you will just reveal yourself as an ASCII chauvinist

Well, I guess I'm an ASCII chauvinis then...

> P.S.: See also http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate

You write

> However, most maintainers of existing applications chose instead to
> do only soft conversion and do not use the libc wide character
> functions, either because they are not yet that widely implemented
> or because this would require too many changes in their software.

For gcc, the issue is both one of portability and performance: the
wide character routines are not available on supported hosts, and the
performance hit of calling mb* routines would be unacceptable. Only
recently, the preprocessor has been improved to go over each input
character only once, and the compilers will soon use tokenization as
produced by the preprocessor (instead of tokenizing themselves all
over). All these improvements would likely be taken back if we had to
call the C library every time.

Regards,
Martin

Follow-Ups:
- Re: gcc ignores locale (no UTF-8 source code supported)
  - From: Joseph S. Myers
- Re: gcc ignores locale (no UTF-8 source code supported)
  - From: Joern Rennecke

References:
- Re: gcc ignores locale (no UTF-8 source code supported)
  - From: Markus Kuhn

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]