This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Imporve 64bit memcpy performance for Haswell CPU with AVX instruction


2014-05-16 4:22 GMT+08:00, OndÅej BÃlka <neleai@seznam.cz>:
> On Fri, May 09, 2014 at 08:40:46PM +0800, Ling Ma wrote:
>> If there are still some issues on the latest memcpy and memset, please
>> let us know.
>>
>> Thanks
>> Ling
>>
>> 2014-04-21 12:52 GMT+08:00, ling.ma.program@gmail.com
>> <ling.ma.program@gmail.com>:
>> > From: Ling Ma <ling.ml@alibaba-inc.com>
>> >
>> > In this patch we take advantage of HSW memory bandwidth, manage to
>> > reduce miss branch prediction by avoiding using branch instructions and
>> > force destination to be aligned with avx instruction.
>> >
>> > The CPU2006 403.gcc benchmark indicates this patch improves performance
>> > from 6% to 14%.
>> >
>> > This version only jump to backward for memove overlap case,
>> > Thanks for Ondra'comments, and that Yuriy gave me c code hint on it.
>
> As now it is slower than a gcc compilation time becomes around
> 0.12% slower than pending sse2 version and indistingushible from current
> version.
>
> I used a benchmark that measures total running time of gcc for five
> hours and report relative time and variance, you could get it here
>
> http://kam.mff.cuni.cz/~ondra/memcpy_consistency_benchmark.tar.bz2
>
> a results I got on haswell are
>
>        memcpy-avx.so     memcpy-sse2.so     memcpy-sse2_v2.so
> memcpy_fuse.so    memcpy_rep8.so           nul.so
>      100.25% +- 0.04%    100.25% +- 0.04%    100.13% +- 0.07%    100.00% +-
> 0.04%    100.34% +- 0.13%    100.95% +- 0.07%
>
> where I tried fusion and rep strategy like in memset which helps.
>
> I tried also to measure it with my benchmark on different function, it
> claims that pending sse2 version is best on gcc+gnuplot load. When I
> looked to graph it looks that it loses on much branching until it gets
> to small sizes, see
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_avx.html
> with profiler here
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_avx150514.tar.bz2
>
Ling: we move less_16bytes to code entry, so there are no  degradation
for small size , attached code and your
profiler:http://www.yunos.org/tmp/memcpy_profile_avx0520.tar.gz
meanwhile we also tested pending memcpy, it is much better than original one,
but avx still give us the best result for large input(we can download
and run it):
www.yunos.org/tmp/test.memcpy.memset.zip

Thanks
Ling

> Longer inputs are faster with avx2 but they do not occur that often.
>
> One reason for inconsistent results is that memcpy affect stores after
> call depending on what pending stores it creates and that cannot be
> measured with memcpy running time alone.
>
> Like in memset I checked big inputs and loop is around 20% faster than
> rep movsq
>
> time LD_PRELOAD=./memcpy-avx.so ./big
> time LD_PRELOAD=./memcpy_rep8.so ./big
>
> with following program
>
> #include <stdlib.h>
> #include <string.h>
> int main(){
>  int i;
>  char *x=malloc(100000080)+rand()%64;
>  char *y=malloc(100000080)+rand()%64;
>
>   for (i=0;i<10;i++)
>    memcpy(x,y,100000000);
> }
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]