This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction


>> We need to check performance for core i7 with AVX before install this.
>> As far as I understood you checked on Haswell only? But AVX works for
>> more architectures than AVX2.
>Using avx for memcpy before haswell is pointless, stores and loads are
>128bit anyway and by going 256bit you only complicate scheduler.

But we can't name it avx2 version and check avx2 flag if it doesn't use avx2.
Probably we should introduce flag Slow AVX and set it before Haswell
if you are sure that using AVX before Haswell is pointless.

--
Liubov

On Fri, Jul 12, 2013 at 7:23 AM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> On Thu, Jul 11, 2013 at 05:59:30PM +0400, Liubov Dmitrieva wrote:
>> We need to check performance for core i7 with AVX before install this.
>> As far as I understood you checked on Haswell only? But AVX works for
>> more architectures than AVX2.
> Using avx for memcpy before haswell is pointless, stores and loads are
> 128bit anyway and by going 256bit you only complicate scheduler.
>>
>> You missed to fix Copyright: s/2010/2013
>>
>> --
>> Liubov
>>
>> On Thu, Jul 11, 2013 at 4:51 PM,  <ling.ma.program@gmail.com> wrote:
>> > From: Ma Ling <ling.ml@alibaba-inc.com>
>> >
>> > We manage to avoid branch instructions, and force destination to be aligned
>> > with avx instruction. We modified gcc.403 so that we can only measure memcpy function,
>> > gcc.403 benchmarks indicate the version improved performance from 4% to 16% on different cases .
>> >
>> > Best Regards
>> > Ling
>> > ---
>> > In this version we did clean-up work, thanks Liubov.
>> >
>> >  sysdeps/x86_64/multiarch/Makefile                |   5 +-
>> >  sysdeps/x86_64/multiarch/ifunc-defines.sym       |   2 +
>> >  sysdeps/x86_64/multiarch/ifunc-impl-list.c       |  11 +
>> >  sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S  | 409 +++++++++++++++++++++++
>> >  sysdeps/x86_64/multiarch/memmove-avx-unaligned.S |   4 +
>> >  sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S |   4 +
>> >  6 files changed, 433 insertions(+), 2 deletions(-)
> As there is not ifunc changed it will not be called at all.
>
>> > +ENTRY (MEMCPY)
>> > +       vzeroupper
> Not needed.
>
>> > +L(256bytesormore):
>> > +
>> > +#ifdef USE_AS_MEMMOVE
>> > +       cmp     %rsi, %rdi
>> > +       jae     L(copy_backward)
>> > +#endif
>
> Test by following condition
> (uint64_t)((src - dest)-n) < 2*n
> it makes branch predicable instead two unpredicable branches.
>
> Also alias memmove_avx to memcpy_avx. As they differ only when you copy 256+
> bytes so performance penalty of this check can be payed by halving
> memcpy icache usage alone.
>
>
>
>> > +       mov     %rdx, %rcx
>> > +       rep     movsb
>> > +       ret
>> > +
> Did haswell got optimized movsb? If so at which interval it works well?
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]