This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: strxfrm output stability


On 09/09/2015 12:45 PM, Zack Weinberg wrote:
On 09/09/2015 03:05 PM, Paul Eggert wrote:
When I tried a decade ago to use strxfrm in GNU sort to speed it up, I
ran into two problems.  First, glibc strxfrm generates lengthy output,
and since I wanted only the first 8 bytes or so I got relatively little
information out of the prefix and overall it was a performance loss on
my benchmarks.
That's interesting, because as *I* understand it, that is exactly what
PostgreSQL is doing

Again, I looked at the 9.4.4 code, but it is doing something fancier and I imagine 9.5 will too. First, it's computing strxfrm for a range of keys, and if the strxfrm transformations of the minimum and maximum keys agree in the first B bytes, it discards the length-B prefix of all the strxfrm transformations. Second, the code looks at the individual bytes of the strxfrm output, and if they're all (say) in the range 'A' through 'Z' it compresses each of them to ~4.7bits (log base 2 of 26). PostgreSQL can do fun stuff like that because it uses floating-point comparators.

they apparently got burnt a time or two by some C
libraries reporting strings as strcoll-identical when they aren't
byte-identical).

As I understand it POSIX allows this behavior, i.e., strcoll is allowed to return 0 even when its arguments are not byte-for-byte identical. I observed this behavior in glibc with this program:

  #include <locale.h>
  #include <string.h>
  #include <stdio.h>

  int
  main (void)
  {
    if (! setlocale (LC_ALL, "en_US.UTF-8"))
      perror ("setlocale");
    return strcoll("\xc9\xa2", "\xc9\xac") == 0;
  }

On my Fedora host this silently exits with status 1. Although this is arguably a bug, I don't see where it violates POSIX.

It seems like there are two distinct use cases here: 1) grab a short, fixed-length tag (effectively a collation-preserving hash); collisions are OK, caller promises to fall back to strcoll() (PostgreSQL, sort(1). 2) smuggle a locale-aware comparison into an API that only allows you to provide a key, not a comparator (Python 3). Maybe it's time to invent a new API? zw

More generally for case (1) one may also want to discard a leading prefix as described above.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]