This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp


On 01/31/2014 04:42 PM, Carlos O'Donell wrote:

* More comments in the assembly explaining the structure of the
   new code and how it operates.

Okay, the comment-worthy aspects might be:

Some meta-comment explaining the byte numbering which illustrates the register contents in the corner cases.

The rationale for the big-endian conversion (to match address ordering), separate 32-bit and 64-bit cases (64-bit BSWAP is much slower), and perhaps the conversion from flags to the -1/1 result using SBB/OR. (All this applies to diffin4bytes/diffin8bytes.)

Unaligned loads overlapping with data previously compared as equal. This data cannot affect the outcome of the final comparison. (This trick was already used before, but it's not entirely obvious.)

The SETNE/CMOVAE sequence for the branchless sign function.

Cases where the sign function isn't needed because the result of the sign subtraction fits into the 32-bit int result.

Anything else?

* Verification via the microbenchmarks that it actually makes
   things faster. Additionally we'd like you add to the microbenchmarks
   if you find we need another test to measure your changes.
   See benchtests/README. We want to be as objective as possible here
   and allow others to run the same or similar tests to compare your
   patch on their hardware.

There already is bench-memcmp. I analyzed its output (after bumping the inner loop count to a reasonable value) and it does not show that the new implementation is slower across the board. I also verified this using R, with a paired t-test: the new implementation appears to be slightly faster, but it's not statistically significant.

In addition, I performed specialized comparisons for varying lengths and shared prefixes, and benchmarked qsort on random data and /usr/share/dict/words. My test loop uses CPUID, yet it turns out that getting good measurements of individual memcmp calls is really quite tricky. My benchmark code is available here: <https://github.com/fweimer/memcmp-timing>

What would be a reasonable benchmark? We could read the manual sources, permute their lines randomly, sort them using qsort, and measure the total time used for sorting. Or we could generate random input data based on some distribution and measure sorting time for that. The challenge is to get reasonable input data without bundling large files. If we reuse something that is part of the source code distribution anyway, comparisons between different releases are not valid, but so is anything that uses qsort. But I would recommend the sorting approach because it seems realistic to me, and qsort does not all that much overhead over the actual comparison.

One thing I would like to get consensus on eventually is whether future implementations of memcmp must not behave as a timing oracle. (There might be architectures where achieving this has a run-time cost.) If we are unwilling to make this promise, the oracle-free version would have to be conditional on _FORTIFY_SOURCE.

--
Florian Weimer / Red Hat Product Security Team


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]