This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2


On Thu, Jun 01, 2017 at 08:45:19AM -0700, H.J. Lu wrote:
> Optimize x86-64 memcmp/wmemcmp with AVX2.  It uses vector compare as
> much as possible.  It is as fast as SSE4 memcmp for size <= 16 bytes
> and up to 2X faster for size > 16 bytes on Haswell and Skylake.  Select
> AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and
> AVX unaligned load is fast.
> 
> Key features:
> 
> 1. Use overlapping compare to avoid branch.
> 2. Use vector compare when size >= 4 bytes for memcmp or size >= 8
>    bytes for wmemcmp.
> 3. If size is 8 * VEC_SIZE or less, unroll the loop.
> 4. Compare 4 * VEC_SIZE at a time with the aligned first memory area.
> 5. Use 2 vector compares when size is 2 * VEC_SIZE or less.
> 6. Use 4 vector compares when size is 4 * VEC_SIZE or less.
> 7. Use 8 vector compares when size is 8 * VEC_SIZE or less.
> 
> Any comments?
>
I have some comments, its similar to one of my previous patches

> +     cmpq    $(VEC_SIZE * 2), %rdx
> +     ja      L(more_2x_vec)
> +
This is unnecessary branch, its likely that there is difference in first
16 bytes regardless of size. Move test about sizes...
> +L(last_2x_vec):
> +     /* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
> +     vmovdqu (%rsi), %ymm2
> +     VPCMPEQ (%rdi), %ymm2, %ymm2
> +     vpmovmskb %ymm2, %eax
> +     subl    $VEC_MASK, %eax
> +     jnz     L(first_vec)
here.



> +L(first_vec):
> +	/* A byte or int32 is different within 16 or 32 bytes.  */
> +	bsfl	%eax, %ecx
> +# ifdef USE_AS_WMEMCMP
> +	xorl	%eax, %eax
> +	movl	(%rdi, %rcx), %edx
> +	cmpl	(%rsi, %rcx), %edx
> +L(wmemcmp_return):
> +	setl	%al
> +	negl	%eax
> +	orl	$1, %eax
> +# else
> +	movzbl	(%rdi, %rcx), %eax
> +	movzbl	(%rsi, %rcx), %edx
> +	sub	%edx, %eax
> +# endif
> +	VZEROUPPER
> +	ret
> +

Loading bytes depending on result of bsf is slow, alternative is to find
that from vector tests. I could avoid it using tests like this but I
didn't measure performance/test it yet.

vmovdqu (%rdi), %ymm3

VPCMPGTQ %ymm2, %ymm3, %ymm4
VPCMPGTQ %ymm3, %ymm2, %ymm5
vpmovmskb %ymm4, %eax
vpmovmskb %ymm5, %edx
neg %eax
neg %edx
lzcnt %eax, %eax
lzcnt %edx, %edx
sub %edx, %eax
ret



> +	.p2align 4
> +L(less_vec):
> +# ifdef USE_AS_WMEMCMP
> +	/* It can only be 0, 4, 8, 12, 16, 20, 24, 28 bytes.  */
> +	cmpb	$4, %dl
> +	je	L(4)
> +	jb	L(zero)
> +# else
> +	cmpb	$1, %dl
> +	je	L(1)
> +	jb	L(zero)
> +	cmpb	$4, %dl
> +	jb	L(between_2_3)
> +	cmpb	$8, %dl
> +	jb	L(between_4_7)
> +# endif
> +	cmpb	$16, %dl
> +	jae	L(between_16_31)

I am net entirely sure about this as it depends on if one calls memcmp
with fixed sizes in loop or not. If size is unpredictable first test if
loads cross page boudary for special case. if not do 32-byte comparison
and if first different byte is bigger than size return 0.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]