This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Is it OK to write ASCII strings directly into locale source files?


On 07/24/2017 09:09 AM, Mike FABIAN wrote:
> 
> Currently the locale source files use a lot of code points even for
> strings which are pure ASCII. For example localedata/locales/de_DE
> contains:
> 
> %	"%a %d %b %Y %T %Z"
> d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
> 
> Would it be OK to write this as
> 
> d_t_fmt "%a %d %b %Y %T %Z"
> 
> ??
> 
> This would make the files much more readable.
> 
> Stuff that is mostly ASCII can probably be written like this:
> 
> % https://oc.wikipedia.org/wiki/Fran%C3%A7a França
> country_name "Fran<U00E7>a"
> 
> which is already more readable then writing it all in <U00??> code points.
> 
> It would be even nicer to write it completely in UTF-8, i.e.:
> 
> country_name "França"
> 
> but I am not sure whether this is allowed in the locale source files.
> 
> But at least for everything which is ASCII, it might be OK already to
> write the characters directly.
> 
> Is writing ASCII there allowed or not??
 
It's not ASCII though is it? Since '<' and '>' have to be reserved
to support parsing of UTF-8 code points, so it's "almost ASCII."

I'm ok using 'almost' ASCII characters as their 1-byte UTF-8 form
instead of the verbose code-points, but we need to document exactly
which characters are allowed. I believe the answer is everything
except '<>'.

I'm not entirely ready to allow all UTF-8, since that descends into
the much more complex discussion around NFC, NFKC, NFD, NFKD etc. and
which form should be used. Then there are discussions around uniqueness
of decomposition and exactly what did the source author want.

So let us start slowly and agree with 'ASCII - [<>]' where < denotes
the start of a code point and > the end of the code point.

-- 
Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]