This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>
- Cc: Ling Ma <ling dot ma dot program at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>, Ma Ling <ling dot ml at alibaba-inc dot com>
- Date: Fri, 12 Jul 2013 05:23:33 +0200
- Subject: Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction
- References: <1373547096-8095-1-git-send-email-ling dot ma dot program at gmail dot com> <CAHjhQ91fVakxKNkEniz0AL-Srn3kNtLf+5AaB+VHozy5_z5zeA at mail dot gmail dot com>
On Thu, Jul 11, 2013 at 05:59:30PM +0400, Liubov Dmitrieva wrote:
> We need to check performance for core i7 with AVX before install this.
> As far as I understood you checked on Haswell only? But AVX works for
> more architectures than AVX2.
Using avx for memcpy before haswell is pointless, stores and loads are
128bit anyway and by going 256bit you only complicate scheduler.
>
> You missed to fix Copyright: s/2010/2013
>
> --
> Liubov
>
> On Thu, Jul 11, 2013 at 4:51 PM, <ling.ma.program@gmail.com> wrote:
> > From: Ma Ling <ling.ml@alibaba-inc.com>
> >
> > We manage to avoid branch instructions, and force destination to be aligned
> > with avx instruction. We modified gcc.403 so that we can only measure memcpy function,
> > gcc.403 benchmarks indicate the version improved performance from 4% to 16% on different cases .
> >
> > Best Regards
> > Ling
> > ---
> > In this version we did clean-up work, thanks Liubov.
> >
> > sysdeps/x86_64/multiarch/Makefile | 5 +-
> > sysdeps/x86_64/multiarch/ifunc-defines.sym | 2 +
> > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 11 +
> > sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S | 409 +++++++++++++++++++++++
> > sysdeps/x86_64/multiarch/memmove-avx-unaligned.S | 4 +
> > sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S | 4 +
> > 6 files changed, 433 insertions(+), 2 deletions(-)
As there is not ifunc changed it will not be called at all.
> > +ENTRY (MEMCPY)
> > + vzeroupper
Not needed.
> > +L(256bytesormore):
> > +
> > +#ifdef USE_AS_MEMMOVE
> > + cmp %rsi, %rdi
> > + jae L(copy_backward)
> > +#endif
Test by following condition
(uint64_t)((src - dest)-n) < 2*n
it makes branch predicable instead two unpredicable branches.
Also alias memmove_avx to memcpy_avx. As they differ only when you copy 256+
bytes so performance penalty of this check can be payed by halving
memcpy icache usage alone.
> > + mov %rdx, %rcx
> > + rep movsb
> > + ret
> > +
Did haswell got optimized movsb? If so at which interval it works well?