This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2


On Thu, Jun 1, 2017 at 11:39 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 06/01/2017 07:19 PM, H.J. Lu wrote:
>> On Thu, Jun 1, 2017 at 9:41 AM, Florian Weimer <fweimer@redhat.com> wrote:
>>> On 06/01/2017 05:45 PM, H.J. Lu wrote:
>>>> +L(between_4_7):
>>>> +     vmovd   (%rdi), %xmm1
>>>> +     vmovd   (%rsi), %xmm2
>>>> +     VPCMPEQ %xmm1, %xmm2, %xmm2
>>>> +     vpmovmskb %xmm2, %eax
>>>> +     subl    $0xffff, %eax
>>>> +     jnz     L(first_vec)
>>>
>>> Is this really faster than two 32-bit bswaps followed by a sub?
>>
>> Can you elaborate how to use bswap here?
>
> Something like this:
>
>   /* Load 4 to 7 bytes into an 8-byte word.
>      ABCDEFG turns into GFEDDCBA.
>      ABCDEF  turns into FEDCDCBA.
>      ABCDE   turns into EDCBDCBA.
>      ABCD    turns into DCBADCBA.
>      bswapq below reverses the order of bytes.
>      The duplicated bytes do not affect the comparison result.  */
>   movl -4(%rdi, %rdx), R1
>   shrq $32, R1
>   movl -4(%rsi, %rdx), R2
>   shrq $32, R2
>   movl ($rdi), R3
>   orq R3, R1
>   /* Variant below starts after this point. */
>   cmpq R1, R2
>   jne L(diffin8bytes)
>   xor %eax, %eax
>   ret
>
> L(diffin8bytes):
>   bswapq R1
>   bswapq R2
>   cmpq R1, R2
>   sbbl %eax, %eax       /* Set to -1 if R1 < R2, otherwise 0.  */
>   orl $1, %eax          /* Turn 0 into 1, but preserve -1.  */
>   ret

I don't think it works with memcmp since return value depends on
the first bytes which differs.  Say

ABCDE   turns into EDCBDCBA

If all bytes differs, we should only compare A, not EDCBDCBA.

> (Not sure about the right ordering for R1 and R2 here.)
>
> There's a way to avoid the conditional jump completely, but whether
> that's worthwhile depends on the cost of the bswapq and the cmove:
>
>   bswapq R1
>   bswapq R2
>   xorl R3, R3
>   cmpq R1, R2
>   sbbl %eax, %eax
>   orl $1, %eax
>   cmpq R1, R2
>   cmove R3, %eax
>   ret
>
> See this patch and the related discussion:
>
>   <https://sourceware.org/ml/libc-alpha/2014-02/msg00139.html>
>
>>> What is ensuring alignment, so that the vmovd instructions cannot fault?
>>
>> What do you mean?  This sequence compares the last 4 bytes with
>> vmovd,  which loads 4 bytes and zeroes out the high 12 bytes, and
>> VPCMPEQ.  If they aren't the same, go to L(first_vec).
>
> Ah, I see now.  The loads overlap.  Maybe add a comment to that effect?

I will add

/* Use overlapping loads to avoid branches.  */

-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]