This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] wcscoll/strcoll needs to check if the char is out of the locale collation


2009/11/30 Takao Fujiwara <tfujiwar@redhat.com>:
> Hi,
>
> I'm attaching the two kind of patches and need your ideas:
> glibc-xx-errno-strcoll.diff
> glibc-xx-set-undefined.diff
>
> Currently I'm thinking how to sort UTF-8 strings on GNOME/GDM.
> GDM uses g_utf8_collate() to sort the UTF-8 language names.
>
> The following link is the source code of g_utf8_collate():
> http://git.gnome.org./cgit/glib/tree/glib/gunicollate.c
>
> g_utf8_collate() uses wcscoll() internally.
>
> ----------------------------
> ?70:gint
> ?71: g_utf8_collate (const gchar *str1,
> ?72: ?const gchar *str2)
> ?73:{
> ...
> 109: ?result = wcscoll ((wchar_t *)str1_norm, (wchar_t *)str2_norm);
> ...
> 154:}
>
> However if the chars are not defined in the locale collation, the
> returned value is not correct.

You need to define correctness in terms of a standard or existing
practice in some other C library. What do C libraries on other
operating systems do?

> I'm attaching the test program (a.c) in this mail to explain the problem.
> It compares the Korean chars and ASCII chars.
> If you run the test program on ja_JP.UTF-8, wcscoll() returns "Korean
> chars < ASCII chars".
> But I would expect "Korean chars > ASCII chars" on ja_JP.UTF-8.

You need to explain why you expect this, is it done this way on another system?

> Then I would think the returned value is not defined in ja_JP collation
> and I thought setting errno would be good if the char is not defined in
> the collation.
> E.g. glibc/localedata/locales/ja_JP LC_COLLATE doesn't include U+D55C so
> I think the ja_JP.UTF-8 collation table doesn't contain all UTF-8 chars.

I don't understand this paragraph, perhaps you could expand the
explanation and take the reader through the logic?

> Regarding to WCSCOLL(3P):
>
> ----------------------------
> RETURN VALUE
> On ?error, ?wcscoll() ?shall ?set
> ? ? ? errno, but no return value is reserved to indicate an error.
>
> ERRORS
> ? ? ? The wcscoll() function may fail if:
>
> ? ? ? EINVAL The ?ws1 ?or ?ws2 arguments contain wide-character codes
> outside
> ? ? ? ? ? ? ?the domain of the collating sequence.
> ----------------------------

You don't need to quote the manpage, simply describe the expected
return value as defined in the appropriate standard.

> The attachment glibc-xx-errno-strcoll.diff sets EINVAL if the value is
> out of the table.
> My understanding is, __collidx_table_lookup() checks in libc.so if the
> char is defined in the collation table so my suggestion is to set errno
> if the char is not defined in the table.
> If wcscoll/strcoll/wscxfrm/strxfrm would set errno, I could enhance
> g_utf8_collate(_key) later.
> E.g. if wcscoll() returns undefined value with errno, wcscmp() could be
> called later.

Have you tested this patch by running the glibc testsuite?

> However somebody might say ja_JP collation table should have all UTF-8
> chars but actually ja_JP file is not so.

Who might say this and why?

> if a char is not defined in glib/localedata/locales/ja_JP,
> __collidx_table_lookup() returns 0 in libc.so.
> If we could use __collseq_table_lookup() instead, it would return the
> max value for the undefined char and I could resolve this problem.
> But I think we need to use __collidx_table_lookup() for wcscoll() since
> the size of locale collation is unclear.
>
> But the problem is when we receive 0, U+0 is actually defined in
> glib/localedata/locales/ja_JP LC_COLLATION and the result is, the
> undefined chars are always collated in front of the defined chars in
> wcscoll().
>
> E.g. If I think a is ASCII char, b is a Japanese char, c is a Korean
> char, the collation would be c < a < b on ja_JP.UTF-8 since U+0 is
> defined in ja_JP file.
>
> But if you look at ja_JP file, the file also defines "UNDEFINED" in
> LC_COLLATE.
> UNDEFINED char should be collated at last.
> But the word "UNDEFINED" seems to be used in localedef program only.
> If we run wcscoll(), we don't know which index of weight[] is the
> UNDEFINED value.

This is not a coherent description of the solution, internal details
are not important right now, what is important is explaining clearly
the two solutions.

> Then I'm attaching another solution (glibc-xx-set-undefined.diff).
>
> So my solution is, if wcscoll() receives 0 from findidx(), wcscoll() use
> USTRING_MAX instead of weight[].
>
> If I see zh_CN file, U+0 is not defined. The undefined chars are always
> collated in front of the defined chars in wcscoll() because the
> following line effects the result in wcscoll():
>
> ? result = seq1len == 0 ? -1 : 1;
>
> seq1len is 0 but the string is not shorter than the other in this case.
> The string is not defined in the locale collation in this case actually.
>
> I'd modified this part in glibc-xx-set-undefined.diff.
>
> Probably it's good for wcscoll() to follow the 'UNDEFINED' keyword in
> the locale collation file and I think 'UNDEFINED' should be put in the
> last of the LC_COLLATE.

You need to expand on why you think this is needed.

Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]