This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] x86_64: memset optimized with AVX512
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Andrew Senkevich <andrew dot n dot senkevich at gmail dot com>
- Cc: libc-alpha <libc-alpha at sourceware dot org>
- Date: Fri, 11 Dec 2015 06:43:27 -0800
- Subject: Re: [PATCH] x86_64: memset optimized with AVX512
- Authentication-results: sourceware.org; auth=none
- References: <CAMXFM3v7K1bO8aUB=rVBzKqAZJuNrMBh5NuyszONtB-sZrXcgA at mail dot gmail dot com> <CAMe9rOrmhNzZTjX9XS2ONhTghCU-UXs3WmWzRWkXhWHCDZi8GQ at mail dot gmail dot com> <CAMXFM3uDEgQyfmJtVzp8QqYCEFJpYMTENhG3RMu26_R6eeAs3g at mail dot gmail dot com> <CAMe9rOrvjXqk6F-fWBy_f4R9Rw0+yzkeb5T63OXkAPiw7OXnGw at mail dot gmail dot com> <CAMXFM3vHyJXK=AWcgBSUWHE7HTHSvLER6DVx2joGdTBFEC528Q at mail dot gmail dot com> <CAMe9rOrx5Hxw8Dxrw6Cmb1WvaDr=LZpsqdZ9ytDXLAgeHTRvGw at mail dot gmail dot com> <CAMXFM3t74ne+uqX+_QFO-EVaU1dBjGsSQFuj3OpVxyav5bHHCg at mail dot gmail dot com> <CAMe9rOqXrjLFycCWcPuVW+3GtVK15cyZdqt4rz5002D2PuiFsQ at mail dot gmail dot com>
On Fri, Dec 11, 2015 at 5:53 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Dec 11, 2015 at 5:45 AM, Andrew Senkevich
> <andrew.n.senkevich@gmail.com> wrote:
>> 2015-12-11 16:39 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>>> On Fri, Dec 11, 2015 at 5:26 AM, Andrew Senkevich
>>> <andrew.n.senkevich@gmail.com> wrote:
>>>> 2015-12-10 22:34 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>>>>> On Thu, Dec 10, 2015 at 10:28 AM, Andrew Senkevich
>>>>> <andrew.n.senkevich@gmail.com> wrote:
>>>>>>>> END (MEMSET)
>>>>>>>> +libc_hidden_def (__memset_avx2)
>>>>>>>
>>>>>>> Why is this change needed? If it is needed, please submit
>>>>>>> a separate patch.
>>>>>>
>>>>>> We can avoid this change if hide implementation, test and IFUNC branch
>>>>>> under HAVE_AVX512_ASM_SUPPORT.
>>>>>>
>>>>>>> Should __memset_chk_avx512 also be provided?
>>>>>>
>>>>>> It will be the same as AVX2 version, is it really needed?
>>>>>
>>>>> __memset_chk_avx2 calls __memset_avx2. Don't you want
>>>>> __memset_chk to call __memset_avx512, instead of __memset_avx2,
>>>>> on KNL?
>>>>
>>>> Oh yes, surely we need it.
>>>>
>>>> Is patch below Ok for trunk?
>>>>
>>>> 2015-12-11 Andrew Senkevich <andrew.senkevich@intel.com>
>>>>
>>>> * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
>>>> * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
>>>> * sysdeps/x86_64/multiarch/memset-avx512.S: New file.
>>>> * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
>>>> * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
>>>>
>>>
>>>> diff --git a/sysdeps/x86_64/multiarch/memset.S
>>>> b/sysdeps/x86_64/multiarch/memset.S
>>>> index dbc00d2..9f16b7e 100644
>>>> --- a/sysdeps/x86_64/multiarch/memset.S
>>>> +++ b/sysdeps/x86_64/multiarch/memset.S
>>>> @@ -30,6 +30,13 @@ ENTRY(memset)
>>>> HAS_ARCH_FEATURE (AVX2_Usable)
>>>> jz 2f
>>>> leaq __memset_avx2(%rip), %rax
>>>> +#ifdef HAVE_AVX512_ASM_SUPPORT
>>>> + HAS_ARCH_FEATURE (AVX512DQ_Usable)
>>>> + jnz 2f
>>>> + HAS_ARCH_FEATURE (AVX512F_Usable)
>>>> + jz 2f
>>>> + leaq __memset_avx512(%rip), %rax
>>>> +#endif
>>>> 2: ret
>>>> END(memset)
>>>> #endif
>>>> diff --git a/sysdeps/x86_64/multiarch/memset_chk.S
>>>> b/sysdeps/x86_64/multiarch/memset_chk.S
>>>> index e2abb15..5115dfb 100644
>>>> --- a/sysdeps/x86_64/multiarch/memset_chk.S
>>>> +++ b/sysdeps/x86_64/multiarch/memset_chk.S
>>>> @@ -30,6 +30,13 @@ ENTRY(__memset_chk)
>>>> HAS_ARCH_FEATURE (AVX2_Usable)
>>>> jz 2f
>>>> leaq __memset_chk_avx2(%rip), %rax
>>>> +#ifdef HAVE_AVX512_ASM_SUPPORT
>>>> + HAS_ARCH_FEATURE (AVX512DQ_Usable)
>>>> + jnz 2f
>>>> + HAS_ARCH_FEATURE (AVX512F_Usable)
>>>> + jz 2f
>>>> + leaq __memset_chk_avx512(%rip), %rax
>>>> +#endif
>>>> 2: ret
>>>> END(__memset_chk)
>>>>
>>>
>>> What is the purpose of checking AVX512DQ_Usable? To
>>> avoid using it on SKX? Is __memset_avx512 slower than
>>> __memset_avx2 on SKX?
>>
>> This is implementation only for KNL because SKX may require vzeroupper
>> on AVX/SEE transition paths.
>> This version became slower with vzeroupper on that paths, so limited
>> to KNL hardware.
>>
>
> Please make following changes:
>
> 1. Change _avx512 to _avx512_no_vzeroupper.
> 2. Add a feature, Prefer_No_VZEROUPPER, to cpu-features.h, and set
> it for KNL.
> 3. Check Prefer_No_VZEROUPPER instead of AVX512DQ_Usable,
> 4. Don't check AVX512DQ_Usable nor Prefer_No_VZEROUPPER in
> ifunc-impl-list.c.
>
I submitted a patch to enable SLM optimization for KNL:
https://sourceware.org/ml/libc-alpha/2015-12/msg00221.html
It is on hjl/32bit/master branch. Please rebase your patch against
mine since it adds KNL optimization.
--
H.J.