This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Bug in collation functions?


On 10/29/2015 11:45 AM, Ken Brown wrote:
On 10/29/2015 11:35 AM, Corinna Vinschen wrote:
On Oct 29 08:59, Ken Brown wrote:
On 10/29/2015 4:30 AM, Corinna Vinschen wrote:
On Oct 29 08:50, Corinna Vinschen wrote:
On Oct 28 21:58, Eric Blake wrote:
On 10/28/2015 04:14 PM, Ken Brown wrote:
It's my understanding that collation is supposed to take
whitespace and
punctuation into account in the POSIX locale but not in other
locales.

Not quite right. It is up to the locale definition whether whitespace
affects collation.  But you are correct that in the POSIX locale,
whitespace must not be ignored in collation.

This doesn't seem to be the case on Cygwin.  Here's a test case
using
wcscoll, but the same problem occurs with strcoll.

That's because the locale definitions are different in cygwin than
they
are in glibc.  But it is not a bug in Cygwin; POSIX allows for
different
systems to have different locale definitions while still using the
same
locale name like en_US.UTF-8.

Btw, strcoll and wcscoll in Cygwin are implemented using the Windows
function CompareStringW with the LCID set to the locale matching the
POSIX locale setting.  I'm rather glad I didn't have to implement this
by myself... :}

OTOH, CompareString has a couple of flags to control its behaviour, see
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx


Right now Cygwin calls CompareStringW with dwCmpFlags set to 0, but
there
are flags like NORM_IGNORENONSPACE, NORM_IGNORESYMBOLS.  I'm open to a
discussion how to change the settings to more closely resemble the
rules
on Linux.

E.g.  wcscoll simply calls wcscmp rather than CompareStringW for the
C/POSIX locale anyway.  So, would it makes sense to set the flags to
NORM_IGNORESYMBOLS in other locales?

I think so.  That's what the native Windows build of emacs does in this
situation.

Is that all it's doing?  I'm asking because using NORM_IGNORESYMBOLS
does not exaclty resemble the behaviour on Linux on my W10 box:

     "11" > "1.1" in POSIX locale
!!! "11" > "1.1" in en_US.UTF-8 locale
     "11" > "1 2" in POSIX locale
     "11" < "1 2" in en_US.UTF-8 locale

I just noticed that myself and was going to ask about that difference. I
don't see anything else that emacs is doing on native Windows.  But in
the test I referred to above, the locale is set to "enu_USA" in the
native Windows build.  Does that explain the discrepancy?  If not, I can
ask on the emacs-devel list whether the test passes on Windows.

Never mind. My test case was flawed, because it didn't check for the possibility that wcscoll might return 0. Here's a revised definition of the "compare" function:

void
compare (const wchar_t *a, const wchar_t *b, const char *loc)
{
  setlocale (LC_COLLATE, loc);
  int res = wcscoll (a, b);
  char c = res < 0 ? '<' : res > 0 ? '>' : '=';
  printf ("\"%ls\" %c \"%ls\" in %s locale\n", a, c, b, loc);
}

With this change (and the use of NORM_IGNORESYMBOLS) the test returns the following on Cygwin:

$ ./wcscoll_test
"11" > "1.1" in POSIX locale
"11" = "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1 2" in en_US.UTF-8 locale

It still differs from Linux, but it's good enough to make the emacs test pass. Moreover, this behavior actually seems more reasonable to me than the Linux behavior. After all, if you're ignoring punctuation, how can you decide which of "11" or "1.1" comes first?

Ken

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]