This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
On Thu, Jun 1, 2017 at 11:39 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 06/01/2017 07:19 PM, H.J. Lu wrote:
>> On Thu, Jun 1, 2017 at 9:41 AM, Florian Weimer <fweimer@redhat.com> wrote:
>>> On 06/01/2017 05:45 PM, H.J. Lu wrote:
>>>> +L(between_4_7):
>>>> + vmovd (%rdi), %xmm1
>>>> + vmovd (%rsi), %xmm2
>>>> + VPCMPEQ %xmm1, %xmm2, %xmm2
>>>> + vpmovmskb %xmm2, %eax
>>>> + subl $0xffff, %eax
>>>> + jnz L(first_vec)
>>>
>>> Is this really faster than two 32-bit bswaps followed by a sub?
>>
>> Can you elaborate how to use bswap here?
>
> Something like this:
>
> /* Load 4 to 7 bytes into an 8-byte word.
> ABCDEFG turns into GFEDDCBA.
> ABCDEF turns into FEDCDCBA.
> ABCDE turns into EDCBDCBA.
> ABCD turns into DCBADCBA.
> bswapq below reverses the order of bytes.
> The duplicated bytes do not affect the comparison result. */
> movl -4(%rdi, %rdx), R1
> shrq $32, R1
> movl -4(%rsi, %rdx), R2
> shrq $32, R2
> movl ($rdi), R3
> orq R3, R1
> /* Variant below starts after this point. */
> cmpq R1, R2
> jne L(diffin8bytes)
> xor %eax, %eax
> ret
>
> L(diffin8bytes):
> bswapq R1
> bswapq R2
> cmpq R1, R2
> sbbl %eax, %eax /* Set to -1 if R1 < R2, otherwise 0. */
> orl $1, %eax /* Turn 0 into 1, but preserve -1. */
> ret
I don't think it works with memcmp since return value depends on
the first bytes which differs. Say
ABCDE turns into EDCBDCBA
If all bytes differs, we should only compare A, not EDCBDCBA.
> (Not sure about the right ordering for R1 and R2 here.)
>
> There's a way to avoid the conditional jump completely, but whether
> that's worthwhile depends on the cost of the bswapq and the cmove:
>
> bswapq R1
> bswapq R2
> xorl R3, R3
> cmpq R1, R2
> sbbl %eax, %eax
> orl $1, %eax
> cmpq R1, R2
> cmove R3, %eax
> ret
>
> See this patch and the related discussion:
>
> <https://sourceware.org/ml/libc-alpha/2014-02/msg00139.html>
>
>>> What is ensuring alignment, so that the vmovd instructions cannot fault?
>>
>> What do you mean? This sequence compares the last 4 bytes with
>> vmovd, which loads 4 bytes and zeroes out the high 12 bytes, and
>> VPCMPEQ. If they aren't the same, go to L(first_vec).
>
> Ah, I see now. The loads overlap. Maybe add a comment to that effect?
I will add
/* Use overlapping loads to avoid branches. */
--
H.J.