This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: memcmp-sse4.S EqualHappy bug


On Thu, Jun 18, 2015 at 05:54:42PM +0200, Andrea Arcangeli wrote:
> On Wed, Jun 17, 2015 at 10:19:58PM +0200, OndÅej BÃlka wrote:
> I fully understand your arguments about the standard and I expected
> this behavior was permitted.
> 
> I'm also pointing out we go a bit beyond in what we pretend from C
> with the READ_ONCE/WRITE_ONCE/volatile/asm("memory") to provide RCU
> (and to implement the spinlock/mutex). I just wanted to express my
> views on the practical aspects and how we could enforce that if a part
> of memory (a part that is separated by atomic granularity of the arch,
> a variable you need to know and isn't 1 byte minimum on alpha for
> example) the memcmp is well defined that can't return 0 (i.e. if it
> returns 0 it actually read all bytes of "length" parameter and at some
> point in time each byte individually was always equal, and the last
> part of the page is never changed and never equal).
> 
> I'm fine if no change is done, and it'd be great if at least the
> manpage of memcmp/bcmp is updated. If it was up to me though I'd
> prefer to fix this case so 0 isn't happily returned too soon
> unexpectedly, as the unrolled loop fast path won't require change.

I see now. As I am writing new memcmp I don't see that likely, as it
adds extra overhead thats hard to justify.

Rereading is needed for good performance, a loop checks 64 bytes at
once and sse2 uses destructive operation so original data wont be there.

A best workaround would be add after final subtraction check if its zero
then call 
memcmp(x+found+1, y+found+1, remaining)

That could be almost free as you need to just add je branch after
subtraction.

However now I need test new way to check first 16 bytes that would avoid
rereading. Problem that it would scale worse when you need combine
results into two 64-byte masks instead one.

mov %rdx, %rcx
neg %rcx
movdqu (%rsi), %xmm0
movdqu (%rdi), %xmm1
movdqu %xmm0, %xmm2
pcmpgtb %xmm1, %xmm0
pcmpgtb %xmm2, %xmm1
pmovmskb %xmm0, %eax
pmovmskb %xmm1, %edx
bswap %eax
bswap %edx
shr %cl, %eax
shr %cl, %edx
cmp $-16, %rcx
jae L(next_48_bytes)
sub %edx, %eax
L(ret):
ret
L(next_48_bytes):
sub %edx, %eax
jne L(ret)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]