This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Intel's new rte_memcpy()

From: OndÅej BÃlka <neleai at seznam dot cz>
To: "H.J. Lu" <hjl dot tools at gmail dot com>
Cc: Luke Gorrie <luke at snabb dot co>, éå(åå) <ling dot ml at alibaba-inc dot com>, GNU C Library <libc-alpha at sourceware dot org>
Date: Sat, 31 Jan 2015 19:48:37 +0100
Subject: Re: Intel's new rte_memcpy()
Authentication-results: sourceware.org; auth=none
References: <CAA2XHbendDcfydewf2nrpPQkSsDWPdEH0SMsnqZAFsLF9q4Fzg at mail dot gmail dot com> <CAMe9rOpELuXQLvHQLLAeZitTTcz-xeg=ROoDm0dHe-fg4m-Jew at mail dot gmail dot com>

On Fri, Jan 30, 2015 at 09:03:50AM -0800, H.J. Lu wrote:
> On Fri, Jan 30, 2015 at 5:52 AM, Luke Gorrie <luke@snabb.co> wrote:
> > Howdy!
> >
> > I am hoping for some feedback and advice for me as an application developer.
> >
> > Intel have recently posted a couple of memcpy() implementations and
> > suggested that these have significant advantages for networking
> > applications. There is one for Sandy Bridge and one for Haswell. The
> > proposal is that networking application developers would statically
> > link one or both of these into their applications instead of
> > dynamically linking with glibc. The proposal is part of their Data
> > Plane Development Kit (dpdk.org).
> >
> > They explain it much better than I do:
> > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> >
> > and their code is here:
> > https://gist.github.com/lukego/efc82a15bde5ec83cb1b
> >> > My question to the list is this:
> > My question to the list is this:
> >
> > Should networking application developers adopt Intel's custom
> > implementation if (like me) they are absolutely dependent on good and
> > consistent performance of memcpy on all recent hardware (>= Sandy
> > Bridge) and Linux distributions? (and then -- what to do about
> > memmove?)
> >
> > I have done some cursory benchmarks with cachebench:
> > http://dpdk.org/ml/archives/dev/2015-January/011574.html
> >
> > ... with a correction to the rte_memcpy on Haswell results:
> > http://dpdk.org/ml/archives/dev/2015-January/011691.html
> >> 
> 
Definitely not. You would need much more sophisticated memcpy that does
runtime profiling per call site to get consistent speedup. There are
several alternatives, one could be 50% faster than others but you need
runtime data to know which one.

As stated in original post cachebench is pretty bad benchmark. If you
randomize sizes and alignment its around 10% slower on in 1-1000 byte
range, as my profiler does.

For bigger sizes a benchmark speedup is questionable for similar reason.
It assumes that data is in L1 cache which in reality does not happen
that often for larger sizes, as you could have only 4 8kb buffers in
32kb L1 cache.

While new avx2 implementation with 8kb block is around 10% faster in L1 cache it is also 
10% around slower when memory is on L2 cache and beyond, see this graph switch to 16 block mode.
It looks that rep movsb is best for copying L2+ data so you need where
in your application is treshold where it happens.

http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand_L2/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand_L3/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand_nocache/result.html

While my benchmarks are more accurate they are still flawed in several
ways as they do not measure real workloads. I looked to code and its
badly optimized for small sizes. It would cause performance regression
for compiling with gcc, see following profile.

http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_gcc/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile/results_gcc/result.html

Final problem is always inlining memcpy. Its already problem in gcc that
already does too much of memcpy inline expansion with suboptimal code.
While this may give benefit for few hottest memcpy callers it harm
performance for others. One copy of rtx_memcpy has 8kb and when it is
not in instruction cache you par 300 cycle performance penalty, see
following benchmark that simulates that situation.

http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile/results_rand_noicache/result.html

A full profiler is here
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile310115.tar.bz2

My suggestion is simple: test it. Take your application do profiling to
identify most frequent memcpy caller, replace it by rtx_memcpy, run you
application if that given performance gain and repeat as necessary. I
cannot know what implementation is best for your workload until I see
what workload you use.

> I import it to hjl/memcpy branch at
> 
> https://sourceware.org/git/?p=glibc.git;a=summary
> 
> Here is the bench-memcpy comparison against __memcpy_avx_unaligned
> on Haswell:
> 
No, these benchmarks are junk as I mentioned in several previous
threads.

References:
- Intel's new rte_memcpy()
  - From: Luke Gorrie
- Re: Intel's new rte_memcpy()
  - From: H.J. Lu

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]