This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Florian Weimer <fweimer at redhat dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Thu, 1 Jun 2017 10:19:09 -0700
- Subject: Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
- Authentication-results: sourceware.org; auth=none
- References: <20170601154519.GB14526@lucon.org> <33f989bd-5357-086a-27a7-7437718f5ac3@redhat.com>
On Thu, Jun 1, 2017 at 9:41 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 06/01/2017 05:45 PM, H.J. Lu wrote:
>> +L(between_4_7):
>> + vmovd (%rdi), %xmm1
>> + vmovd (%rsi), %xmm2
>> + VPCMPEQ %xmm1, %xmm2, %xmm2
>> + vpmovmskb %xmm2, %eax
>> + subl $0xffff, %eax
>> + jnz L(first_vec)
>
> Is this really faster than two 32-bit bswaps followed by a sub?
Can you elaborate how to use bswap here?
>> + leaq -4(%rdi, %rdx), %rdi
>> + leaq -4(%rsi, %rdx), %rsi
>> + vmovd (%rdi), %xmm1
>> + vmovd (%rsi), %xmm2
>> + VPCMPEQ %xmm1, %xmm2, %xmm2
>> + vpmovmskb %xmm2, %eax
>> + subl $0xffff, %eax
>> + jnz L(first_vec)
>> + ret
>
> What is ensuring alignment, so that the vmovd instructions cannot fault?
What do you mean? This sequence compares the last 4 bytes with
vmovd, which loads 4 bytes and zeroes out the high 12 bytes, and
VPCMPEQ. If they aren't the same, go to L(first_vec).
>> + .p2align 4
>> +L(between_2_3):
>> + /* Load 2 bytes into registers. */
>> + movzwl (%rdi), %eax
>> + movzwl (%rsi), %ecx
>> + /* Compare the lowest byte. */
>> + cmpb %cl, %al
>> + jne L(1byte_reg)
>> + /* Load the difference of 2 bytes into EAX. */
>> + subl %ecx, %eax
>> + /* Return if 2 bytes differ. */
>> + jnz L(exit)
>> + cmpb $2, %dl
>> + /* Return if these are the last 2 bytes. */
>> + je L(exit)
>> + movzbl 2(%rdi), %eax
>> + movzbl 2(%rsi), %ecx
>> + subl %ecx, %eax
>> + ret
>
> Again, bswap should be faster, and if we assume that the ordering of the
> inputs is more difficult to predict than the length, it would be better
> to construct the full 24-bit value before comparing it.
>
Can you elaborate it here?
Thanks.
--
H.J.