This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Character classifications and language-dependence

From: Keld Jørn Simonsen <keld at dkuug dot dk>
To: Ludovic Courtès <ludovic dot courtes at laas dot fr>
Cc: libc-locales at sources dot redhat dot com
Date: Thu, 14 Sep 2006 22:46:31 +0200
Subject: Re: Character classifications and language-dependence
References: <87psdy5rsg.fsf@laas.fr>

On Thu, Sep 14, 2006 at 06:36:31PM +0200, Ludovic Courtès wrote:
> Hi,
> 
> Currently, many locale definition files that come with glibc (actually
> mostly those of western languages) include the "i18n" FDCC-set under
> their `LC_CTYPE' category.
> 
> However, the "i18n" FDCC-set contains a very broad character
> classification: it considers at least all Latin, Greek and Cyrillic
> letters as part of the `alpha' character class (as seen in Section 4.3.2
> of ISO 14652 [0] and glibc's version).  Thus, all the languages whose
> locale includes "i18n" end up having a lot of letters in their `alpha'
> character class, more than actually exist in the language.
> 
> For instance, while `ê' (`e' circumflex) is a letter in French, it is
> not a letter in Castellano; likewise, `ñ' is a letter in Castellano, but
> not in French.  But since glibc's locale definitions for `fr_FR' and
> `es_ES' both include "i18n", `isalpha(3)' returns true for both locales.
> 
> Section 4 of ISO 14652 reads:
> 
>   This Technical Report also defines an FDCC-set named "i18n" with
>   values for some of the above categories in order to simplify FDCC-set
>   descriptions for a number of cultures.  The contents of "i18n"
>   categories should not necessarily be considered as the most commonly
>   accepted values, while in many cases it could be the recommended
>   values.
> 
> Thus, my understanding is that glibc's heavy use of "i18n" for character
> classifications is acceptable, though not representative of "the most
> commonly accepted values".  Therefore, one could for instance refine the
> `fr_FR' character classification so that only French letters (e.g., not
> `ñ') are found under its `alpha' class.
> 
> Is this correct?  If so, are there plans to actually refine (some of)
> these character classifications?

The reasoning behind considering a-circumflex and the like a letter,
also in languages not normally using it, is that in general readers will
recognize it as a letter, and somewhat know how to pronounce it etc.
Thus in Denmark â is used for example in names of French wines, like
"Château de Bonfils" and this may occur regularily eg. in newpaper
advertisements, or on menus in restaturants. It is thus good to know
that â can be part of a word, and thus it should be in class alpha of
this locale. The same would be valid for possibly all other locales
of the world.

I don't know if there is any work on some locales to change this, 
but I would recommend against it. However, one could think of creating
new classes for specific purposes. What would your use be?

best regards
keld

Follow-Ups:
- Re: Character classifications and language-dependence
  - From: Ludovic Courtès

References:
- Character classifications and language-dependence
  - From: Ludovic Courtès

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]