This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2


On Thu, Jun 1, 2017 at 9:41 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 06/01/2017 05:45 PM, H.J. Lu wrote:
>> +L(between_4_7):
>> +     vmovd   (%rdi), %xmm1
>> +     vmovd   (%rsi), %xmm2
>> +     VPCMPEQ %xmm1, %xmm2, %xmm2
>> +     vpmovmskb %xmm2, %eax
>> +     subl    $0xffff, %eax
>> +     jnz     L(first_vec)
>
> Is this really faster than two 32-bit bswaps followed by a sub?

Can you elaborate how to use bswap here?

>> +     leaq    -4(%rdi, %rdx), %rdi
>> +     leaq    -4(%rsi, %rdx), %rsi
>> +     vmovd   (%rdi), %xmm1
>> +     vmovd   (%rsi), %xmm2
>> +     VPCMPEQ %xmm1, %xmm2, %xmm2
>> +     vpmovmskb %xmm2, %eax
>> +     subl    $0xffff, %eax
>> +     jnz     L(first_vec)
>> +     ret
>
> What is ensuring alignment, so that the vmovd instructions cannot fault?

What do you mean?  This sequence compares the last 4 bytes with
vmovd,  which loads 4 bytes and zeroes out the high 12 bytes, and
VPCMPEQ.  If they aren't the same, go to L(first_vec).

>> +     .p2align 4
>> +L(between_2_3):
>> +     /* Load 2 bytes into registers.  */
>> +     movzwl  (%rdi), %eax
>> +     movzwl  (%rsi), %ecx
>> +     /* Compare the lowest byte.  */
>> +     cmpb    %cl, %al
>> +     jne     L(1byte_reg)
>> +     /* Load the difference of 2 bytes into EAX.  */
>> +     subl    %ecx, %eax
>> +     /* Return if 2 bytes differ.  */
>> +     jnz     L(exit)
>> +     cmpb    $2, %dl
>> +     /* Return if these are the last 2 bytes.  */
>> +     je      L(exit)
>> +     movzbl  2(%rdi), %eax
>> +     movzbl  2(%rsi), %ecx
>> +     subl    %ecx, %eax
>> +     ret
>
> Again, bswap should be faster, and if we assume that the ordering of the
> inputs is more difficult to predict than the length, it would be better
> to construct the full 24-bit value before comparing it.
>

Can you elaborate it here?

Thanks.


-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]