This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
- From: Florian Weimer <fweimer at redhat dot com>
- To: "H.J. Lu" <hongjiu dot lu at intel dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Thu, 1 Jun 2017 18:41:48 +0200
- Subject: Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
- Authentication-results: sourceware.org; auth=none
- Authentication-results: ext-mx09.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
- Authentication-results: ext-mx09.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=fweimer at redhat dot com
- Dkim-filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 987FB6197C
- Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 987FB6197C
- References: <20170601154519.GB14526@lucon.org>
On 06/01/2017 05:45 PM, H.J. Lu wrote:
> +L(between_4_7):
> + vmovd (%rdi), %xmm1
> + vmovd (%rsi), %xmm2
> + VPCMPEQ %xmm1, %xmm2, %xmm2
> + vpmovmskb %xmm2, %eax
> + subl $0xffff, %eax
> + jnz L(first_vec)
Is this really faster than two 32-bit bswaps followed by a sub?
> + leaq -4(%rdi, %rdx), %rdi
> + leaq -4(%rsi, %rdx), %rsi
> + vmovd (%rdi), %xmm1
> + vmovd (%rsi), %xmm2
> + VPCMPEQ %xmm1, %xmm2, %xmm2
> + vpmovmskb %xmm2, %eax
> + subl $0xffff, %eax
> + jnz L(first_vec)
> + ret
What is ensuring alignment, so that the vmovd instructions cannot fault?
> + .p2align 4
> +L(between_2_3):
> + /* Load 2 bytes into registers. */
> + movzwl (%rdi), %eax
> + movzwl (%rsi), %ecx
> + /* Compare the lowest byte. */
> + cmpb %cl, %al
> + jne L(1byte_reg)
> + /* Load the difference of 2 bytes into EAX. */
> + subl %ecx, %eax
> + /* Return if 2 bytes differ. */
> + jnz L(exit)
> + cmpb $2, %dl
> + /* Return if these are the last 2 bytes. */
> + je L(exit)
> + movzbl 2(%rdi), %eax
> + movzbl 2(%rsi), %ecx
> + subl %ecx, %eax
> + ret
Again, bswap should be faster, and if we assume that the ordering of the
inputs is more difficult to predict than the length, it would be better
to construct the full 24-bit value before comparing it.
Thanks,
Florian