This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2

From: "H.J. Lu" <hjl dot tools at gmail dot com>
To: Ondřej Bílka <neleai at seznam dot cz>, Andrew Senkevich <andrew dot n dot senkevich at gmail dot com>
Cc: GNU C Library <libc-alpha at sourceware dot org>
Date: Thu, 15 Jun 2017 19:15:00 -0700
Subject: Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
Authentication-results: sourceware.org; auth=none
References: <20170601154519.GB14526@lucon.org> <20170615123415.GA22917@domone.kolej.mff.cuni.cz>

On Thu, Jun 15, 2017 at 5:34 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Thu, Jun 01, 2017 at 08:45:19AM -0700, H.J. Lu wrote:
>> Optimize x86-64 memcmp/wmemcmp with AVX2.  It uses vector compare as
>> much as possible.  It is as fast as SSE4 memcmp for size <= 16 bytes
>> and up to 2X faster for size > 16 bytes on Haswell and Skylake.  Select
>> AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and
>> AVX unaligned load is fast.
>>
>> Key features:
>>
>> 1. Use overlapping compare to avoid branch.
>> 2. Use vector compare when size >= 4 bytes for memcmp or size >= 8
>>    bytes for wmemcmp.
>> 3. If size is 8 * VEC_SIZE or less, unroll the loop.
>> 4. Compare 4 * VEC_SIZE at a time with the aligned first memory area.
>> 5. Use 2 vector compares when size is 2 * VEC_SIZE or less.
>> 6. Use 4 vector compares when size is 4 * VEC_SIZE or less.
>> 7. Use 8 vector compares when size is 8 * VEC_SIZE or less.
>>
>> Any comments?
>>
> I have some comments, its similar to one of my previous patches
>
>> +     cmpq    $(VEC_SIZE * 2), %rdx
>> +     ja      L(more_2x_vec)
>> +
> This is unnecessary branch, its likely that there is difference in first
> 16 bytes regardless of size. Move test about sizes...
>> +L(last_2x_vec):
>> +     /* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
>> +     vmovdqu (%rsi), %ymm2
>> +     VPCMPEQ (%rdi), %ymm2, %ymm2
>> +     vpmovmskb %ymm2, %eax
>> +     subl    $VEC_MASK, %eax
>> +     jnz     L(first_vec)
> here.
>

If we do that, the size check will be redundant from

        /* Less than 4 * VEC.  */
        cmpq    $VEC_SIZE, %rdx
        jbe     L(last_vec)
        cmpq    $(VEC_SIZE * 2), %rdx
        jbe     L(last_2x_vec)

L(last_4x_vec):

Of cause, we can duplicate these blocks to avoid size.

>
>> +L(first_vec):
>> +     /* A byte or int32 is different within 16 or 32 bytes.  */
>> +     bsfl    %eax, %ecx
>> +# ifdef USE_AS_WMEMCMP
>> +     xorl    %eax, %eax
>> +     movl    (%rdi, %rcx), %edx
>> +     cmpl    (%rsi, %rcx), %edx
>> +L(wmemcmp_return):
>> +     setl    %al
>> +     negl    %eax
>> +     orl     $1, %eax
>> +# else
>> +     movzbl  (%rdi, %rcx), %eax
>> +     movzbl  (%rsi, %rcx), %edx
>> +     sub     %edx, %eax
>> +# endif
>> +     VZEROUPPER
>> +     ret
>> +
>
> Loading bytes depending on result of bsf is slow, alternative is to find
> that from vector tests. I could avoid it using tests like this but I
> didn't measure performance/test it yet.
>
> vmovdqu (%rdi), %ymm3
>
> VPCMPGTQ %ymm2, %ymm3, %ymm4
> VPCMPGTQ %ymm3, %ymm2, %ymm5
> vpmovmskb %ymm4, %eax
> vpmovmskb %ymm5, %edx
> neg %eax
> neg %edx
> lzcnt %eax, %eax
> lzcnt %edx, %edx
> sub %edx, %eax
> ret

Andrew, can you give it a try?

>
>
>> +     .p2align 4
>> +L(less_vec):
>> +# ifdef USE_AS_WMEMCMP
>> +     /* It can only be 0, 4, 8, 12, 16, 20, 24, 28 bytes.  */
>> +     cmpb    $4, %dl
>> +     je      L(4)
>> +     jb      L(zero)
>> +# else
>> +     cmpb    $1, %dl
>> +     je      L(1)
>> +     jb      L(zero)
>> +     cmpb    $4, %dl
>> +     jb      L(between_2_3)
>> +     cmpb    $8, %dl
>> +     jb      L(between_4_7)
>> +# endif
>> +     cmpb    $16, %dl
>> +     jae     L(between_16_31)
>
> I am net entirely sure about this as it depends on if one calls memcmp
> with fixed sizes in loop or not. If size is unpredictable first test if
> loads cross page boudary for special case. if not do 32-byte comparison
> and if first different byte is bigger than size return 0.

There are 2 loads from 2 different sources.  We need to do 2 address
checks before using 32-byte vector comparison,   I don't know if it will
be faster.

-- 
H.J.

Follow-Ups:
- Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
  - From: Andrew Senkevich

References:
- [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
  - From: H.J. Lu
- Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
  - From: Ondřej Bílka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]