This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
- From: Florian Weimer <fweimer at redhat dot com>
- To: "Carlos O'Donell" <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Cc: OndÅej BÃlka <neleai at seznam dot cz>
- Date: Mon, 03 Feb 2014 11:25:34 +0100
- Subject: Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
- Authentication-results: sourceware.org; auth=none
- References: <52EBBCC2 dot 7090807 at redhat dot com> <52EBC480 dot 4000509 at redhat dot com>
On 01/31/2014 04:42 PM, Carlos O'Donell wrote:
* More comments in the assembly explaining the structure of the
new code and how it operates.
Okay, the comment-worthy aspects might be:
Some meta-comment explaining the byte numbering which illustrates the
register contents in the corner cases.
The rationale for the big-endian conversion (to match address ordering),
separate 32-bit and 64-bit cases (64-bit BSWAP is much slower), and
perhaps the conversion from flags to the -1/1 result using SBB/OR. (All
this applies to diffin4bytes/diffin8bytes.)
Unaligned loads overlapping with data previously compared as equal.
This data cannot affect the outcome of the final comparison. (This
trick was already used before, but it's not entirely obvious.)
The SETNE/CMOVAE sequence for the branchless sign function.
Cases where the sign function isn't needed because the result of the
sign subtraction fits into the 32-bit int result.
Anything else?
* Verification via the microbenchmarks that it actually makes
things faster. Additionally we'd like you add to the microbenchmarks
if you find we need another test to measure your changes.
See benchtests/README. We want to be as objective as possible here
and allow others to run the same or similar tests to compare your
patch on their hardware.
There already is bench-memcmp. I analyzed its output (after bumping the
inner loop count to a reasonable value) and it does not show that the
new implementation is slower across the board. I also verified this
using R, with a paired t-test: the new implementation appears to be
slightly faster, but it's not statistically significant.
In addition, I performed specialized comparisons for varying lengths and
shared prefixes, and benchmarked qsort on random data and
/usr/share/dict/words. My test loop uses CPUID, yet it turns out that
getting good measurements of individual memcmp calls is really quite
tricky. My benchmark code is available here:
<https://github.com/fweimer/memcmp-timing>
What would be a reasonable benchmark? We could read the manual sources,
permute their lines randomly, sort them using qsort, and measure the
total time used for sorting. Or we could generate random input data
based on some distribution and measure sorting time for that. The
challenge is to get reasonable input data without bundling large files.
If we reuse something that is part of the source code distribution
anyway, comparisons between different releases are not valid, but so is
anything that uses qsort. But I would recommend the sorting approach
because it seems realistic to me, and qsort does not all that much
overhead over the actual comparison.
One thing I would like to get consensus on eventually is whether future
implementations of memcmp must not behave as a timing oracle. (There
might be architectures where achieving this has a run-time cost.) If we
are unwilling to make this promise, the oracle-free version would have
to be conditional on _FORTIFY_SOURCE.
--
Florian Weimer / Red Hat Product Security Team