This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

iswxxxxx/towxxxer and Unicode



Hi,

For fixing bug report libc/1251, I've prepared a patch which changes
the behaviour of the iswalpha etc. and towlower etc. functions. I
created an FDCC-set called "unicode" (automatically generated from
UnicodeData.txt) containing only an LC_CTYPE and LC_IDENTIFICATION
category. In all locales I changed

LC_CTYPE
copy "i18n"
END LC_CTYPE

to

LC_CTYPE
copy "unicode"
END LC_CTYPE

It works well, but there are some issues:

1) iswcntrl(0x0000) now returns 1. Why did you change iswcntrl(0x0000)
to return 0 a few weeks ago? The UnicodeData.txt file classifies
0x0000 as a "<control>" character. The only characters which are
neither control nor printable (i.e. no attributes at all) are those
which have not been assigned by Unicode.

2) iswspace(0x00A0) now returns 0. This is mandated by SUSV2
(http://www.opengroup.org/onlinepubs/007908799/xbd/locale.html) which
says:
   - print: Define characters to be classified as printable
     characters, including the space character.
   - graph: Define characters to be classified as printable
     characters, not including the space character.
From this I infer that the difference between print and graph is only
the space character (0x0020). Thus iswgraph(0x00A0) = 1. Furthermore
it says:
   - space: no character specified for the keywords upper, lower,
     alpha, digit, graph or xdigit can be specified.
Which forces iswspace(0x00A0) to be 0. Which is not a bad thing,
because for line-breaking and parsing purposes, U+00A0 must be treated
differently from U+0020.

3) The compiled LC_CTYPE locale is now 1.2 MB large; before it was
around 130 KB. With more than 60 supported locales, the
/usr/lib/locales/ directory will grow to 70 MB. (And I don't have
added the wcwidth information yet!)

4) localedef takes 4 minutes to create such a large LC_CTYPE
file, on a fast machine. Thus "make check" takes 20 minutes, and
"make localedata/install-locales" takes several hours.

I think 3) and 4) is unacceptable. I propose to change the format of
tables used for these properties to 2-stage tables. This way, you can
get away with 11 KB for each of the tolower/toupper tables and
probably around 2 KB on average for each of the attribute tables. Do
you want me to work on this?

Bruno

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]