This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction


On Thu, Jul 11, 2013 at 05:59:30PM +0400, Liubov Dmitrieva wrote:
> We need to check performance for core i7 with AVX before install this.
> As far as I understood you checked on Haswell only? But AVX works for
> more architectures than AVX2.
Using avx for memcpy before haswell is pointless, stores and loads are
128bit anyway and by going 256bit you only complicate scheduler.
> 
> You missed to fix Copyright: s/2010/2013
> 
> --
> Liubov
> 
> On Thu, Jul 11, 2013 at 4:51 PM,  <ling.ma.program@gmail.com> wrote:
> > From: Ma Ling <ling.ml@alibaba-inc.com>
> >
> > We manage to avoid branch instructions, and force destination to be aligned
> > with avx instruction. We modified gcc.403 so that we can only measure memcpy function,
> > gcc.403 benchmarks indicate the version improved performance from 4% to 16% on different cases .
> >
> > Best Regards
> > Ling
> > ---
> > In this version we did clean-up work, thanks Liubov.
> >
> >  sysdeps/x86_64/multiarch/Makefile                |   5 +-
> >  sysdeps/x86_64/multiarch/ifunc-defines.sym       |   2 +
> >  sysdeps/x86_64/multiarch/ifunc-impl-list.c       |  11 +
> >  sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S  | 409 +++++++++++++++++++++++
> >  sysdeps/x86_64/multiarch/memmove-avx-unaligned.S |   4 +
> >  sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S |   4 +
> >  6 files changed, 433 insertions(+), 2 deletions(-)
As there is not ifunc changed it will not be called at all.

> > +ENTRY (MEMCPY)
> > +       vzeroupper
Not needed.

> > +L(256bytesormore):
> > +
> > +#ifdef USE_AS_MEMMOVE
> > +       cmp     %rsi, %rdi
> > +       jae     L(copy_backward)
> > +#endif

Test by following condition
(uint64_t)((src - dest)-n) < 2*n
it makes branch predicable instead two unpredicable branches.

Also alias memmove_avx to memcpy_avx. As they differ only when you copy 256+ 
bytes so performance penalty of this check can be payed by halving
memcpy icache usage alone.



> > +       mov     %rdx, %rcx
> > +       rep     movsb
> > +       ret
> > +
Did haswell got optimized movsb? If so at which interval it works well?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]