This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Character classifications and language-dependence

From: Keld Jørn Simonsen <keld at dkuug dot dk>
To: Ludovic Courtès <ludovic dot courtes at laas dot fr>
Cc: Keld Jørn Simonsen <keld at dkuug dot dk>,libc-locales at sources dot redhat dot com
Date: Fri, 15 Sep 2006 19:26:38 +0200
Subject: Re: Character classifications and language-dependence
References: <87psdy5rsg.fsf@laas.fr> <20060914204631.GA10468@rap.rap.dk> <87y7sl1s9z.fsf@laas.fr>

On Fri, Sep 15, 2006 at 09:51:52AM +0200, Ludovic Courtès wrote:
> Hi,
> 
> Keld Jørn Simonsen <keld@dkuug.dk> writes:
> 
> > The reasoning behind considering a-circumflex and the like a letter,
> > also in languages not normally using it, is that in general readers will
> > recognize it as a letter, and somewhat know how to pronounce it etc.
> > Thus in Denmark â is used for example in names of French wines, like
> > "Château de Bonfils" and this may occur regularily eg. in newpaper
> > advertisements, or on menus in restaturants. It is thus good to know
> > that â can be part of a word, and thus it should be in class alpha of
> > this locale. The same would be valid for possibly all other locales
> > of the world.
> 
> This is a good point.  More generally, readers of variants of the Latin
> alphabet will recognize accented Latin letters as letters.
> 
> OTOH, "i18n" also includes letters from other alphabets, like Greek and
> Cyrillic, and it is unclear whether all those alphabets (and variants
> thereof) can be considered "mutually recognizable" by their readers.
> 
> "Recognizability" of a letter is probably very subjective.  For
> instance, accented letters found in Castellano, Italian, and French,
> certainly look familiar to each other.  However, accented Latin letters
> found in Central and Eastern European languages (e.g., `e' with cedilla,
> as in Polish -- more generally, Latin letters not part of Latin-1)
> certainly look very "unusual" to readers of French, Castellano, Italian,
> etc...

My first observation is that when these strange characters occur, it is
for a reason. There is an intended audience that will understand what is
written, and for those, as they would know how to read it, then it
should follow the rules for the characters and scripts in question. 

My other observation is that in the EU, where both you and I live, all
citizens are required by law to be treated equally, in every member
state of the EU. That IMHO includes that every citizen has a right to
have his or her name spelled correctly. Now the EU includes countries
like Poland (with weird character) and Denmark (weird characters like æøå) and
Greece (with a lot of weird characters) and soon to be member Bulgaria,
which uses the Cyrillic script. So for all public institutions there is
a requirement emerging to be able to handle all these letters in all
these scripts. Making locales that only is valid for the public sector,
and then other locales for the private sector and such seems not a good
way forward.

> > I don't know if there is any work on some locales to change this, 
> > but I would recommend against it. However, one could think of creating
> > new classes for specific purposes. What would your use be?
> 
> Actually, I don't have any specific use case in mind.  Since the UCD
> already allows the construction of a list of "all existing letters",
> regardless of the language or script they "belong" to, my feeling was
> that, conversely, locales could provide more language-specific
> knowledge.
> 
> Initially, I was just wondering whether this broad and (to some extent)
> language-independent character classification is glibc-specific, or
> whether it is following some standard or recommendation.

AFAIK glibc follows ISO 14652 recommendations, which essensially is the
same as what Unicode advocates: that all the letters of the different
script and also the ideographics are considered belonging to class
alpha.

I think changing this would change current behaviour, in many times
unexpectdedly. That is why I would rather have a new class for this, and
with some explicit field of usage, so that programmers using this class
whould know what to expect, worldwide.

best regards
keld

Follow-Ups:
- Re: Character classifications and language-dependence
  - From: Ludovic Courtès

References:
- Character classifications and language-dependence
  - From: Ludovic Courtès
- Re: Character classifications and language-dependence
  - From: Keld Jørn Simonsen
- Re: Character classifications and language-dependence
  - From: Ludovic Courtès

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]