This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp


On 01/31/2014 10:09 AM, Florian Weimer wrote:
> This patch vectorizes the difference extraction in memcmp.  It does
> so by loading all relevant input bytes, converting them to big-endian
> using BSWAP, and comparing the result.
> 
> This changes the return value of memcmp, but stays within the
> interface contract.  Some corner cases in the old implementation do
> not return actual byte differences, either, so I hope this will not
> break applications.
> 
> On my i7-2620M, some lengths (mostly 5, 6, 7) are about one cycle
> slower than the old implementation, but only if all branches in the
> old implementation are correctly predicted.  If there is
> misprediction (as there is bound to be when sorting strings), the
> vectorized version is consistently faster.  The glibc benchtests
> favor the old implementation[*], but the speed difference between
> tests slightly favors the new implementation (but the difference is
> not clearly significant).  I could probably game that by using a
> data-dependent branch in for those lengths, but I think the fully
> vectorized version is better for real-world input data.
> 
> In case BSWAP turns out to be too slow for some CPUs, it could be
> replaced with BSF and a shift (both constant time in current
> processors), like we already do in some of the existing memcmp
> implementations.
> 
> This work was prompted by a desire to make the time taken by memcmp
> independent of the length of the shared prefix of the arguments as
> far as possible.  The current implementation can still be used to
> confirm a correct guess of a 4-byte or 8-byte prefix, but this is
> extremely unlikely and not suitable for use in a timing oracle (at
> least the 8-byte guess).  The new implementation should not act as a
> cache oracle either because it does not perform data-dependent
> single-byte loads.
> 
> If such a change is acceptable in principle, I will make work on
> similar adjustments to the other x86 implementations.

I think any such changes are acceptable as long as we have:

* A detailed patch description such as this one. Thanks for that!

* More comments in the assembly explaining the structure of the
  new code and how it operates.

* Verification via the microbenchmarks that it actually makes
  things faster. Additionally we'd like you add to the microbenchmarks
  if you find we need another test to measure your changes.
  See benchtests/README. We want to be as objective as possible here
  and allow others to run the same or similar tests to compare your
  patch on their hardware.

> [*] Tight loop over identical inputs, after bumping the inner loop
> count from 64 to 4096.  If I don't do this, the tests are very
> fragile, but the new implementation seems to be roughly twice as fast
> across the board.  Using qsort to sort random strings of fixed length
> is about 10% faster over all using the new implementation.  I still
> need to try to sort a random permutation of /usr/dict/word and
> compare the results.
> 

Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]