This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 0/2] Multiarch hooks for memcpy variants

From: Patrick McGehearty <patrick dot mcgehearty at oracle dot com>
To: libc-alpha at sourceware dot org
Date: Tue, 15 Aug 2017 15:11:26 -0500
Subject: Re: [PATCH 0/2] Multiarch hooks for memcpy variants
Authentication-results: sourceware.org; auth=none
References: <DB6PR0801MB20534ED1010DDF1B033821EE83890@DB6PR0801MB2053.eurprd08.prod.outlook.com> <18d2fdf8-ca55-1ded-fa66-3509b3bcf8fe@gotplt.org> <598DF02B.8010607@arm.com> <CAKCAbMg27DXDe=5vCCtBAW-g5BUkHKPb=_VTV7kr6cq_U91-Cg@mail.gmail.com> <4072a19f-eecb-8cdd-889f-46b4c8b968b4@gotplt.org> <CAKCAbMh8=u27ZcS9La4SdQ3UiHi76TZdv_KSCpX0pkY8WMohOQ@mail.gmail.com> <DB6PR0801MB20538D64F211A965ED3E806D838C0@DB6PR0801MB2053.eurprd08.prod.outlook.com> <8ce803fd-37d2-d249-9953-1ad60be34518@gotplt.org> <DB6PR0801MB20534DB60FC61454853BE557838C0@DB6PR0801MB2053.eurprd08.prod.outlook.com>

On 8/14/2017 8:22 AM, Wilco Dijkstra wrote:

Siddhesh Poyarekar wrote:

The first part is not true for falkor since its implementation is a good
10-15% faster on the falkor chip due to its design differences.  glibc
makes pretty extensive use of memcpy throughout, but I don't have data
on how much difference a core-specific memcpy will make there, so I
don't have enough grounds for a generic change.

66% of memcpy calls are <=16 bytes. Assuming you can even get a 15% gain
for these small sizes (there is very little you can do different), that's at most 1
cycle faster, so the PLT indirection is going to be more expensive.

It is important to be careful about overemphasizing the frequency ofshort memcpy calls.Even though a high percentage of memcpy calls are short, my experienceis that a high

percentage of time spent in memcpy is on longer copies.

Following example is just that, an example, not an expression of anyspecific real application behavior:If 66% of calls are <=16 bytes (average length=8, say) but the averagelength of the remaining1/3 of calls was 1K bytes (i.e. > 100 times as long), then the vastmajority of time

in memcpy would be in the longer copies.

My experience with tuning libc memcpy off and on on multiple platformsis that copiesof length > 256 bytes are the ones that affect overall applicationperformance. Really shortcopies where the length and/or alignment might be known at compile timeare best handled

by inlining the copy.

I've produced platform specific optimizations for memcpy many times overthe years. By platformspecific, I mean different code for different generations/platforms ofthe same architecture.These versions have shown improvements from at little as 10% to as muchas 250%depending on how close the memory architecture of latest platform is tothe prior platform.

Typical factors that can influence best memcpy performance a specificplatform for a given architecture include:ideal prefetch distance ... depends on processor speed, cache/memorylatency, depth of memorysubsystem queues, details of memory subsystem priorities forprefetch vs demand fetch, and more.

number of alu operations that can be issued per cycle
number of memory operations that can be issued per cycle
number of total instructions that can be issued per cycle
branch misprediction latency; branch predictor behavior; other branch issues
and many other architectural features which can make occasional differences

I find it hard to imagine a single generic memcpy library routine thatcan match the performanceof a platform specific tuned routine over a typical range of copylengths, assuming the architecturehas been around long enough to go through several semiconductor processredesigns.With dynamic linking, the overhead of using platform specific code forsomething frequently

called should be relatively minimal.

I do agree a good generic version should be available as the effort offinding the best tuningfor a particular platform can take weeks and not allarchitecture/platform combinations

will get that intense attention.

- patrick mcgehearty

Your last point about hurting everything else is very valid though; it's
very likely that adding an extra indirection in cases where
__memcpy_generic is going to be called anyway is going to be expensive
given that a bulk of the memcpy calls will be for small sizes of less
than 1k.

Note that the falkor version does quite well in memcpy-random across several
micro architectures so I think parts of it could be moved into the generic code.

Allowing a PLT only for __memcpy_chk and mempcpy would need a test case
waiver in check_localplt and that would become a blanket OK for PLT
usage for memcpy, which we don't want.  Hence my patch is probably the
best compromise, especially since there is precedent for the approach in
x86.

I still can't see any reason to even support these entry points in GLIBC, let
alone optimize them using ifuncs. The _chk functions should obviously be
inlined to avoid all the target specific complexity for no benefit. I think this
could trivially be done via the GLIBC headers already. (That's assuming they
are in any way performance critical.)

Wilco

References:
- Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  - From: Wilco Dijkstra
- Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  - From: Siddhesh Poyarekar
- Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  - From: Szabolcs Nagy
- Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  - From: Zack Weinberg
- Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  - From: Siddhesh Poyarekar
- Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  - From: Zack Weinberg
- Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  - From: Wilco Dijkstra
- Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  - From: Siddhesh Poyarekar
- Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]