This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: memcmp-sse4.S EqualHappy bug
- From: Andrea Arcangeli <aarcange at redhat dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Szabolcs Nagy <nsz at port70 dot net>, libc-alpha at sourceware dot org, "H.J. Lu" <hongjiu dot lu at intel dot com>, "Dr. David Alan Gilbert" <dgilbert at redhat dot com>, Simo Sorce <ssorce at redhat dot com>
- Date: Fri, 19 Jun 2015 15:29:22 +0200
- Subject: Re: memcmp-sse4.S EqualHappy bug
- Authentication-results: sourceware.org; auth=none
- References: <20150617172903 dot GC4317 at redhat dot com> <20150617185952 dot GE22285 at port70 dot net> <20150617201958 dot GA12298 at domone> <20150618155442 dot GI14955 at redhat dot com> <20150618180517 dot GA25190 at domone>
On Thu, Jun 18, 2015 at 08:05:17PM +0200, OndÅej BÃlka wrote:
> I see now. As I am writing new memcmp I don't see that likely, as it
> adds extra overhead thats hard to justify.
>
> Rereading is needed for good performance, a loop checks 64 bytes at
> once and sse2 uses destructive operation so original data wont be there.
>
> A best workaround would be add after final subtraction check if its zero
> then call
> memcmp(x+found+1, y+found+1, remaining)
>
> That could be almost free as you need to just add je branch after
> subtraction.
Yes, it's free for the unrolled loop, just the breakout of the
unrolled loop needs to adjust rdx in addition of rsi/rdi to be able to
check it to see if it's at the end before returning zero.
> However now I need test new way to check first 16 bytes that would avoid
> rereading. Problem that it would scale worse when you need combine
> results into two 64-byte masks instead one.
>
> mov %rdx, %rcx
> neg %rcx
> movdqu (%rsi), %xmm0
> movdqu (%rdi), %xmm1
> movdqu %xmm0, %xmm2
> pcmpgtb %xmm1, %xmm0
> pcmpgtb %xmm2, %xmm1
> pmovmskb %xmm0, %eax
> pmovmskb %xmm1, %edx
> bswap %eax
> bswap %edx
The unrolled loop I think it's faster if it does ptest only, those are
plenty more sse4 instructions than current code does in the sse4
part. I guess it's likely measurably slower, but then mine is just a
guess and I haven't benchmarked.
Just saying the problem is not re-reading, re-reading is fine. Above
zero or below zero values would be undefined anyway no matter if we
re-read or not. The only defined thing is the function cannot return 0
and it currently does.
> shr %cl, %eax
> shr %cl, %edx
> cmp $-16, %rcx
> jae L(next_48_bytes)
> sub %edx, %eax
> L(ret):
> ret
> L(next_48_bytes):
> sub %edx, %eax
> jne L(ret)
>