This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction

From: OndÅej BÃlka <neleai at seznam dot cz>
To: ling dot ma dot program at gmail dot com
Cc: libc-alpha at sourceware dot org, Ling <ling dot ml at alibaba-inc dot com>
Date: Wed, 5 Jun 2013 14:18:16 +0200
Subject: Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
References: <1370424188-4259-1-git-send-email-ling dot ml at alibaba-inc dot com>

On Wed, Jun 05, 2013 at 05:23:08AM -0400, ling.ma.program@gmail.com wrote:
> From: Ling <ling.ml@alibaba-inc.com>
> 
> This patch includes optimized 64bit memcpy/memmove for Corei7 with avx2 instruction.
> It improves memcpy by up to 2X on Corei7, and memmove by up to 2x as well.
> 
> Any comments are appreciated.
> 
> Thanks
> Ling
> ---

Hi, 
I wrote optimized memcpy. As avx/avx2 is concerned I tried to improve memset 
with avx and it did not made noticable difference. I did not know what is
optimal for haswell.

We need data. I wrote a profiler for memcpy, download:

kam.mff.cuni.cz/~ondra/memcpy_profile.tar.bz2

Simplest way you could test your version is copy it into file 
variant/memcpy_new2.s and rename  function to memcpy_new_u
(or add it to FILES in makefile, update variants.h file and adjust
test_sse to detect avx2.)

Running test is done by command:
./benchmarks

Then browse result* directories like this:
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile/results_gcc/result.html

What is important for memcpy is to be fastest in real world usage so 
I picked results_gcc which measures time when in gcc. 
You can see on graph how important is to have fast code for small sizes
as it in gcc contributes most to runtime.
Then you could compare benchmarks in results_rand ... directories.

Also you missed vzeroupper which gives you huge 60 cycle penalty per
call.

Follow-Ups:
- Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
  - From: Ling Ma

References:
- [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
  - From: ling . ma . program

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]