This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp

From: Florian Weimer <fweimer at redhat dot com>
To: "Carlos O'Donell" <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
Cc: OndÅej BÃlka <neleai at seznam dot cz>
Date: Mon, 03 Feb 2014 11:25:34 +0100
Subject: Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
Authentication-results: sourceware.org; auth=none
References: <52EBBCC2 dot 7090807 at redhat dot com> <52EBC480 dot 4000509 at redhat dot com>

On 01/31/2014 04:42 PM, Carlos O'Donell wrote:

* More comments in the assembly explaining the structure of the
   new code and how it operates.


Okay, the comment-worthy aspects might be:

Some meta-comment explaining the byte numbering which illustrates theregister contents in the corner cases.

The rationale for the big-endian conversion (to match address ordering),separate 32-bit and 64-bit cases (64-bit BSWAP is much slower), andperhaps the conversion from flags to the -1/1 result using SBB/OR. (Allthis applies to diffin4bytes/diffin8bytes.)

Unaligned loads overlapping with data previously compared as equal.This data cannot affect the outcome of the final comparison. (Thistrick was already used before, but it's not entirely obvious.)


The SETNE/CMOVAE sequence for the branchless sign function.

Cases where the sign function isn't needed because the result of thesign subtraction fits into the 32-bit int result.


Anything else?

* Verification via the microbenchmarks that it actually makes
   things faster. Additionally we'd like you add to the microbenchmarks
   if you find we need another test to measure your changes.
   See benchtests/README. We want to be as objective as possible here
   and allow others to run the same or similar tests to compare your
   patch on their hardware.

There already is bench-memcmp. I analyzed its output (after bumping theinner loop count to a reasonable value) and it does not show that thenew implementation is slower across the board. I also verified thisusing R, with a paired t-test: the new implementation appears to beslightly faster, but it's not statistically significant.

In addition, I performed specialized comparisons for varying lengths andshared prefixes, and benchmarked qsort on random data and/usr/share/dict/words. My test loop uses CPUID, yet it turns out thatgetting good measurements of individual memcmp calls is really quitetricky. My benchmark code is available here:<https://github.com/fweimer/memcmp-timing>

What would be a reasonable benchmark? We could read the manual sources,permute their lines randomly, sort them using qsort, and measure thetotal time used for sorting. Or we could generate random input databased on some distribution and measure sorting time for that. Thechallenge is to get reasonable input data without bundling large files.If we reuse something that is part of the source code distributionanyway, comparisons between different releases are not valid, but so isanything that uses qsort. But I would recommend the sorting approachbecause it seems realistic to me, and qsort does not all that muchoverhead over the actual comparison.

One thing I would like to get consensus on eventually is whether futureimplementations of memcmp must not behave as a timing oracle. (Theremight be architectures where achieving this has a run-time cost.) If weare unwilling to make this promise, the oracle-free version would haveto be conditional on _FORTIFY_SOURCE.


--
Florian Weimer / Red Hat Product Security Team

Follow-Ups:
- Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
  - From: Joseph S. Myers
- Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]