This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction


Hi Ondřej Bílka,

We have downloaded "kam.mff.cuni.cz/~ondra/memcpy_profile.tar.bz2" and
run the benchmark without any modification as you suggested . But the
results are all zero.
Haswell processor will issue 2-load ops and 1-write ops in one cycle,
and every memory operation touch 32bytes on L1 hits. memcpy only use
load and write in pairs, so  in theory memcpy will get 32bytes/cycle
when L1 hits.

Thanks
Ling

2013/6/5, Ondřej Bílka <neleai@seznam.cz>:
> On Wed, Jun 05, 2013 at 05:23:08AM -0400, ling.ma.program@gmail.com wrote:
>> From: Ling <ling.ml@alibaba-inc.com>
>>
>> This patch includes optimized 64bit memcpy/memmove for Corei7 with avx2
>> instruction.
>> It improves memcpy by up to 2X on Corei7, and memmove by up to 2x as
>> well.
>>
>> Any comments are appreciated.
>>
>> Thanks
>> Ling
>> ---
>
> Hi,
> I wrote optimized memcpy. As avx/avx2 is concerned I tried to improve memset
>
> with avx and it did not made noticable difference. I did not know what is
> optimal for haswell.
>
> We need data. I wrote a profiler for memcpy, download:
>
> kam.mff.cuni.cz/~ondra/memcpy_profile.tar.bz2
>
> Simplest way you could test your version is copy it into file
> variant/memcpy_new2.s and rename  function to memcpy_new_u
> (or add it to FILES in makefile, update variants.h file and adjust
> test_sse to detect avx2.)
>
> Running test is done by command:
> ./benchmarks
>
> Then browse result* directories like this:
> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile/results_gcc/result.html
>
> What is important for memcpy is to be fastest in real world usage so
> I picked results_gcc which measures time when in gcc.
> You can see on graph how important is to have fast code for small sizes
> as it in gcc contributes most to runtime.
> Then you could compare benchmarks in results_rand ... directories.
>
> Also you missed vzeroupper which gives you huge 60 cycle penalty per
> call.
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]