This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: strxfrm output stability
- From: Paul Eggert <eggert at cs dot ucla dot edu>
- To: libc-alpha at sourceware dot org
- Date: Wed, 9 Sep 2015 12:05:12 -0700
- Subject: Re: strxfrm output stability
- Authentication-results: sourceware.org; auth=none
- References: <55EF4F95 dot 4020703 at redhat dot com> <20150908211805 dot 36E5E2C3A73 at topped-with-meat dot com> <55EF529E dot 7070108 at redhat dot com> <55EF5494 dot 8030506 at cs dot ucla dot edu> <55F01BDC dot 70908 at redhat dot com> <55F06230 dot 8080003 at cs dot ucla dot edu> <55F0644C dot 8050703 at panix dot com>
On 09/09/2015 09:54 AM, Zack Weinberg wrote:
https://lwn.net/Articles/653411/ indicates that PostgreSQL's use of
strxfrm is new in 9.5
Ah, I was looking at the latest stable PostgreSQL release (9.4.4), so it
sounds like they've cleaned up the strxfrm-using code. Yet another
reason for me to rethink my position that strxfrm is silly.
When I tried a decade ago to use strxfrm in GNU sort to speed it up, I
ran into two problems. First, glibc strxfrm generates lengthy output,
and since I wanted only the first 8 bytes or so I got relatively little
information out of the prefix and overall it was a performance loss on
my benchmarks. As I understand it the new PostgreSQL code works around
this problem by doing a collation-preserving compression of the strxfrm
output. It would be better, I expect, if GNU libc strxfrm did that
compression itself rather than require every caller who cares about
efficiency to do it.
The second problem was more of an irritation: I wanted the first (say) 8
bytes of strxfrm output, but there's no standard way of getting them
without generating all the strxfrm output bytes and then discarding the
unwanted trailing bytes, which is slow. (I hacked around this problem
by asking for the first 8+DELTA bytes, where DELTA is adjusted so that
it works in practice, but this is clearly an unportable-in-theory
hack.) It would be nicer if glibc strxfrm would guarantee to fill up
its output buffer even when truncated.
Perhaps if I stold the idea of collation-preserving compression from
PostgreSQL and used that idea in GNU sort, there would be enough
performance gain to make strxfrm worthwhile there. Or perhaps we could
fix glibc strxfrm to perform better as described above -- though this
would mean strxfrm's output wouldn't be stable from the current glibc
release to the next one.