This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2


On 06/01/2017 05:45 PM, H.J. Lu wrote:
> +L(between_4_7):
> +	vmovd	(%rdi), %xmm1
> +	vmovd	(%rsi), %xmm2
> +	VPCMPEQ %xmm1, %xmm2, %xmm2
> +	vpmovmskb %xmm2, %eax
> +	subl    $0xffff, %eax
> +	jnz	L(first_vec)

Is this really faster than two 32-bit bswaps followed by a sub?

> +	leaq	-4(%rdi, %rdx), %rdi
> +	leaq	-4(%rsi, %rdx), %rsi
> +	vmovd	(%rdi), %xmm1
> +	vmovd	(%rsi), %xmm2
> +	VPCMPEQ %xmm1, %xmm2, %xmm2
> +	vpmovmskb %xmm2, %eax
> +	subl    $0xffff, %eax
> +	jnz	L(first_vec)
> +	ret

What is ensuring alignment, so that the vmovd instructions cannot fault?

> +	.p2align 4
> +L(between_2_3):
> +	/* Load 2 bytes into registers.  */
> +	movzwl	(%rdi), %eax
> +	movzwl	(%rsi), %ecx
> +	/* Compare the lowest byte.  */
> +	cmpb	%cl, %al
> +	jne	L(1byte_reg)
> +	/* Load the difference of 2 bytes into EAX.  */
> +	subl	%ecx, %eax
> +	/* Return if 2 bytes differ.  */
> +	jnz	L(exit)
> +	cmpb	$2, %dl
> +	/* Return if these are the last 2 bytes.  */
> +	je	L(exit)
> +	movzbl	2(%rdi), %eax
> +	movzbl	2(%rsi), %ecx
> +	subl	%ecx, %eax
> +	ret

Again, bswap should be faster, and if we assume that the ordering of the
inputs is more difficult to predict than the length, it would be better
to construct the full 24-bit value before comparing it.

Thanks,
Florian


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]