This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: memcmp-sse4.S EqualHappy bug

From: Andrea Arcangeli <aarcange at redhat dot com>
To: OndÅej BÃlka <neleai at seznam dot cz>
Cc: Szabolcs Nagy <nsz at port70 dot net>, libc-alpha at sourceware dot org, "H.J. Lu" <hongjiu dot lu at intel dot com>, "Dr. David Alan Gilbert" <dgilbert at redhat dot com>, Simo Sorce <ssorce at redhat dot com>
Date: Fri, 19 Jun 2015 15:29:22 +0200
Subject: Re: memcmp-sse4.S EqualHappy bug
Authentication-results: sourceware.org; auth=none
References: <20150617172903 dot GC4317 at redhat dot com> <20150617185952 dot GE22285 at port70 dot net> <20150617201958 dot GA12298 at domone> <20150618155442 dot GI14955 at redhat dot com> <20150618180517 dot GA25190 at domone>

On Thu, Jun 18, 2015 at 08:05:17PM +0200, OndÅej BÃlka wrote:
> I see now. As I am writing new memcmp I don't see that likely, as it
> adds extra overhead thats hard to justify.
> 
> Rereading is needed for good performance, a loop checks 64 bytes at
> once and sse2 uses destructive operation so original data wont be there.
> 
> A best workaround would be add after final subtraction check if its zero
> then call 
> memcmp(x+found+1, y+found+1, remaining)
> 
> That could be almost free as you need to just add je branch after
> subtraction.

Yes, it's free for the unrolled loop, just the breakout of the
unrolled loop needs to adjust rdx in addition of rsi/rdi to be able to
check it to see if it's at the end before returning zero.

> However now I need test new way to check first 16 bytes that would avoid
> rereading. Problem that it would scale worse when you need combine
> results into two 64-byte masks instead one.
> 
> mov %rdx, %rcx
> neg %rcx
> movdqu (%rsi), %xmm0
> movdqu (%rdi), %xmm1
> movdqu %xmm0, %xmm2
> pcmpgtb %xmm1, %xmm0
> pcmpgtb %xmm2, %xmm1
> pmovmskb %xmm0, %eax
> pmovmskb %xmm1, %edx
> bswap %eax
> bswap %edx

The unrolled loop I think it's faster if it does ptest only, those are
plenty more sse4 instructions than current code does in the sse4
part. I guess it's likely measurably slower, but then mine is just a
guess and I haven't benchmarked.

Just saying the problem is not re-reading, re-reading is fine. Above
zero or below zero values would be undefined anyway no matter if we
re-read or not. The only defined thing is the function cannot return 0
and it currently does.

> shr %cl, %eax
> shr %cl, %edx
> cmp $-16, %rcx
> jae L(next_48_bytes)
> sub %edx, %eax
> L(ret):
> ret
> L(next_48_bytes):
> sub %edx, %eax
> jne L(ret)
>

References:
- memcmp-sse4.S EqualHappy bug
  - From: Andrea Arcangeli
- Re: memcmp-sse4.S EqualHappy bug
  - From: Szabolcs Nagy
- Re: memcmp-sse4.S EqualHappy bug
  - From: OndÅej BÃlka
- Re: memcmp-sse4.S EqualHappy bug
  - From: Andrea Arcangeli
- Re: memcmp-sse4.S EqualHappy bug
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]