This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] wcscoll/strcoll needs to check if the char is out ofthe locale collation


Thanks for your reply.

Comments in lines.

On (12/02/2009 06:00 AM), Carlos O'Donell-san wrote:
2009/11/30 Takao Fujiwara<tfujiwar@redhat.com>:
Hi,

I'm attaching the two kind of patches and need your ideas:
glibc-xx-errno-strcoll.diff
glibc-xx-set-undefined.diff

Currently I'm thinking how to sort UTF-8 strings on GNOME/GDM.
GDM uses g_utf8_collate() to sort the UTF-8 language names.

The following link is the source code of g_utf8_collate():
http://git.gnome.org./cgit/glib/tree/glib/gunicollate.c

g_utf8_collate() uses wcscoll() internally.

----------------------------
  70:gint
  71: g_utf8_collate (const gchar *str1,
  72:  const gchar *str2)
  73:{
...
109:  result = wcscoll ((wchar_t *)str1_norm, (wchar_t *)str2_norm);
...
154:}

However if the chars are not defined in the locale collation, the
returned value is not correct.

You need to define correctness in terms of a standard or existing practice in some other C library. What do C libraries on other operating systems do?

I'm not sure if I understand the question.
As I quoted in this original mail, the manpage explains the behavior. And the manpage is also available in http://opengroup.org.
Unfortunately I could not find the specification between language chars but I have tested some chars on UNIX and Windows and the collation result is good for me except for GNU libc.


In UNIX, some of non-UTF-8 collation is not clear for me but UTF-8 collation is good in the attached test case at least. So errno might be a good solution for non-UTF-8 since glibc/localedata/locales/ja_JP is not used for ja_JP.UTF-8 only but also for ja_JP.eucJP, ja_JP.SJIS.


I'm attaching the test program (a.c) in this mail to explain the problem.
It compares the Korean chars and ASCII chars.
If you run the test program on ja_JP.UTF-8, wcscoll() returns "Korean
chars<  ASCII chars".
But I would expect "Korean chars>  ASCII chars" on ja_JP.UTF-8.

You need to explain why you expect this, is it done this way on another system?

The main reason of mine is, if the expectation is not the result, the sorted strings looks ugly on GUI(e.g. GDM language name list dialog). The current behavior is, some of non-latin chars are collated in front of latin chars and all Japanese chars are collated after latin chars and other chars are collated after all Japanese chars.
I think the sort might need the common collation and locale specific collation likes UCA, ICU.


My test case works with this way on UNIX and MS-Windows.


Then I would think the returned value is not defined in ja_JP collation
and I thought setting errno would be good if the char is not defined in
the collation.
E.g. glibc/localedata/locales/ja_JP LC_COLLATE doesn't include U+D55C so
I think the ja_JP.UTF-8 collation table doesn't contain all UTF-8 chars.

I don't understand this paragraph, perhaps you could expand the explanation and take the reader through the logic?

If you have a look at glibc/localedata/locales/ja_JP, Japanese LC_COLLATE is defined in the lines between order_start forward and order_end. almost Japanese char collations are defined in the ja_JP file but any Korean char collations are not found in ja_JP file, e.g. U+D55C .
wcscoll() uses __collidx_table_lookup() for the collation data and __collidx_table_lookup() returns '0' in U+D55C and I think the reason is that the code point is not defined in the ja_JP file.
It mean I think the current ja_JP collation doesn't include all UTF-8 chars on ja_JP.UTF-8.



Regarding to WCSCOLL(3P):

----------------------------
RETURN VALUE
On  error,  wcscoll()  shall  set
       errno, but no return value is reserved to indicate an error.

ERRORS
       The wcscoll() function may fail if:

       EINVAL The  ws1  or  ws2 arguments contain wide-character codes
outside
              the domain of the collating sequence.
----------------------------

You don't need to quote the manpage, simply describe the expected return value as defined in the appropriate standard.

However this might be the only definition in POSIX when I searched. I tested the behavior to assign the actual code points in wcscoll().


The attachment glibc-xx-errno-strcoll.diff sets EINVAL if the value is
out of the table.
My understanding is, __collidx_table_lookup() checks in libc.so if the
char is defined in the collation table so my suggestion is to set errno
if the char is not defined in the table.
If wcscoll/strcoll/wscxfrm/strxfrm would set errno, I could enhance
g_utf8_collate(_key) later.
E.g. if wcscoll() returns undefined value with errno, wcscmp() could be
called later.

Have you tested this patch by running the glibc testsuite?

I'm not familiar with the glibc testsuite... so I just tested some code points with wcscoll().
If any links are available, I'd try it.



However somebody might say ja_JP collation table should have all UTF-8
chars but actually ja_JP file is not so.

Who might say this and why?

Actually I'm not sure if the actual people might say this. However the first patch sets errno if __collidx_table_lookup() returns '0' and my understanding is, the return value of __collidx_table_lookup() depends on the glib/localedata/locales/ja_JP file on ja_JP.UTF-8 locale.



if a char is not defined in glib/localedata/locales/ja_JP,
__collidx_table_lookup() returns 0 in libc.so.
If we could use __collseq_table_lookup() instead, it would return the
max value for the undefined char and I could resolve this problem.
But I think we need to use __collidx_table_lookup() for wcscoll() since
the size of locale collation is unclear.

But the problem is when we receive 0, U+0 is actually defined in
glib/localedata/locales/ja_JP LC_COLLATION and the result is, the
undefined chars are always collated in front of the defined chars in
wcscoll().

E.g. If I think a is ASCII char, b is a Japanese char, c is a Korean
char, the collation would be c<  a<  b on ja_JP.UTF-8 since U+0 is
defined in ja_JP file.

But if you look at ja_JP file, the file also defines "UNDEFINED" in
LC_COLLATE.
UNDEFINED char should be collated at last.
But the word "UNDEFINED" seems to be used in localedef program only.
If we run wcscoll(), we don't know which index of weight[] is the
UNDEFINED value.

This is not a coherent description of the solution, internal details are not important right now, what is important is explaining clearly the two solutions.

OK, if the first patch glibc-xx-errno-strcoll.diff is acceptable, I could implement the following scenario:


if wcscoll() returns a value with no errno on ja_JP.* locale, I can expect the value is defined in glib/localedata/locales/ja_JP and the returned value can be used for software applications with the locale collation.
If wcscoll() returns a value with errno on ja_JP.*, I can expect the value is not defined in the ja_JP file and the collation is not available with the locale collation and then another function can be called, e.g. wcscmp(), without the locale collation.
We need to call wcscoll(). E.g. if Composed 'a' + '`' (U+00E0) and ASCII 'b' is compared, we'd like to get 'a' + '`' < 'b' on some locales.


Probably I think the errno solution is also good for non-UTF-8 whose char range is smaller than UTF-8. E.g. ISO8859-1 is very small code range but wcscoll() takes wchar_t* as the argument. WCHAR_MAX value is higher than the end of ISO8859-1 char so many wchar_t values are not used in ISO8859-1.


Then I'm attaching another solution (glibc-xx-set-undefined.diff).

So my solution is, if wcscoll() receives 0 from findidx(), wcscoll() use
USTRING_MAX instead of weight[].

If I see zh_CN file, U+0 is not defined. The undefined chars are always
collated in front of the defined chars in wcscoll() because the
following line effects the result in wcscoll():

result = seq1len == 0 ? -1 : 1;

seq1len is 0 but the string is not shorter than the other in this case.
The string is not defined in the locale collation in this case actually.

I'd modified this part in glibc-xx-set-undefined.diff.

Probably it's good for wcscoll() to follow the 'UNDEFINED' keyword in
the locale collation file and I think 'UNDEFINED' should be put in the
last of the LC_COLLATE.

You need to expand on why you think this is needed.

If the second patch glibc-xx-set-undefined.diff is acceptable, I can apply the returned value of wcscoll() to the software applications simply and the GUI appearance looks better the current behavior because the patched libc can show ASCII < Japanese < other languages chars. I think it's easy for the GUI users to find one string in the sorted strings.


Thanks,
fujiwara


Cheers, Carlos.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]