This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction
- From: Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>
- To: Ondřej Bílka <neleai at seznam dot cz>
- Cc: Ling Ma <ling dot ma dot program at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>, Ma Ling <ling dot ml at alibaba-inc dot com>
- Date: Fri, 12 Jul 2013 10:09:03 +0400
- Subject: Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction
- References: <1373547096-8095-1-git-send-email-ling dot ma dot program at gmail dot com> <CAHjhQ91fVakxKNkEniz0AL-Srn3kNtLf+5AaB+VHozy5_z5zeA at mail dot gmail dot com> <20130712032333 dot GA5839 at domone dot PAOCY>
>> We need to check performance for core i7 with AVX before install this.
>> As far as I understood you checked on Haswell only? But AVX works for
>> more architectures than AVX2.
>Using avx for memcpy before haswell is pointless, stores and loads are
>128bit anyway and by going 256bit you only complicate scheduler.
But we can't name it avx2 version and check avx2 flag if it doesn't use avx2.
Probably we should introduce flag Slow AVX and set it before Haswell
if you are sure that using AVX before Haswell is pointless.
--
Liubov
On Fri, Jul 12, 2013 at 7:23 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> On Thu, Jul 11, 2013 at 05:59:30PM +0400, Liubov Dmitrieva wrote:
>> We need to check performance for core i7 with AVX before install this.
>> As far as I understood you checked on Haswell only? But AVX works for
>> more architectures than AVX2.
> Using avx for memcpy before haswell is pointless, stores and loads are
> 128bit anyway and by going 256bit you only complicate scheduler.
>>
>> You missed to fix Copyright: s/2010/2013
>>
>> --
>> Liubov
>>
>> On Thu, Jul 11, 2013 at 4:51 PM, <ling.ma.program@gmail.com> wrote:
>> > From: Ma Ling <ling.ml@alibaba-inc.com>
>> >
>> > We manage to avoid branch instructions, and force destination to be aligned
>> > with avx instruction. We modified gcc.403 so that we can only measure memcpy function,
>> > gcc.403 benchmarks indicate the version improved performance from 4% to 16% on different cases .
>> >
>> > Best Regards
>> > Ling
>> > ---
>> > In this version we did clean-up work, thanks Liubov.
>> >
>> > sysdeps/x86_64/multiarch/Makefile | 5 +-
>> > sysdeps/x86_64/multiarch/ifunc-defines.sym | 2 +
>> > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 11 +
>> > sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S | 409 +++++++++++++++++++++++
>> > sysdeps/x86_64/multiarch/memmove-avx-unaligned.S | 4 +
>> > sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S | 4 +
>> > 6 files changed, 433 insertions(+), 2 deletions(-)
> As there is not ifunc changed it will not be called at all.
>
>> > +ENTRY (MEMCPY)
>> > + vzeroupper
> Not needed.
>
>> > +L(256bytesormore):
>> > +
>> > +#ifdef USE_AS_MEMMOVE
>> > + cmp %rsi, %rdi
>> > + jae L(copy_backward)
>> > +#endif
>
> Test by following condition
> (uint64_t)((src - dest)-n) < 2*n
> it makes branch predicable instead two unpredicable branches.
>
> Also alias memmove_avx to memcpy_avx. As they differ only when you copy 256+
> bytes so performance penalty of this check can be payed by halving
> memcpy icache usage alone.
>
>
>
>> > + mov %rdx, %rcx
>> > + rep movsb
>> > + ret
>> > +
> Did haswell got optimized movsb? If so at which interval it works well?
>