This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 0/2] Multiarch hooks for memcpy variants


On 8/14/2017 8:22 AM, Wilco Dijkstra wrote:
Siddhesh Poyarekar wrote:
The first part is not true for falkor since its implementation is a good
10-15% faster on the falkor chip due to its design differences.  glibc
makes pretty extensive use of memcpy throughout, but I don't have data
on how much difference a core-specific memcpy will make there, so I
don't have enough grounds for a generic change.
66% of memcpy calls are <=16 bytes. Assuming you can even get a 15% gain
for these small sizes (there is very little you can do different), that's at most 1
cycle faster, so the PLT indirection is going to be more expensive.
It is important to be careful about overemphasizing the frequency of short memcpy calls. Even though a high percentage of memcpy calls are short, my experience is that a high
percentage of time spent in memcpy is on longer copies.

Following example is just that, an example, not an expression of any specific real application behavior: If 66% of calls are <=16 bytes (average length=8, say) but the average length of the remaining 1/3 of calls was 1K bytes (i.e. > 100 times as long), then the vast majority of time
in memcpy would be in the longer copies.

My experience with tuning libc memcpy off and on on multiple platforms is that copies of length > 256 bytes are the ones that affect overall application performance. Really short copies where the length and/or alignment might be known at compile time are best handled
by inlining the copy.

I've produced platform specific optimizations for memcpy many times over the years. By platform specific, I mean different code for different generations/platforms of the same architecture. These versions have shown improvements from at little as 10% to as much as 250% depending on how close the memory architecture of latest platform is to the prior platform.

Typical factors that can influence best memcpy performance a specific platform for a given architecture include: ideal prefetch distance ... depends on processor speed, cache/memory latency, depth of memory subsystem queues, details of memory subsystem priorities for prefetch vs demand fetch, and more.
number of alu operations that can be issued per cycle
number of memory operations that can be issued per cycle
number of total instructions that can be issued per cycle
branch misprediction latency; branch predictor behavior; other branch issues
and many other architectural features which can make occasional differences

I find it hard to imagine a single generic memcpy library routine that can match the performance of a platform specific tuned routine over a typical range of copy lengths, assuming the architecture has been around long enough to go through several semiconductor process redesigns. With dynamic linking, the overhead of using platform specific code for something frequently
called should be relatively minimal.

I do agree a good generic version should be available as the effort of finding the best tuning for a particular platform can take weeks and not all architecture/platform combinations
will get that intense attention.

- patrick mcgehearty


Your last point about hurting everything else is very valid though; it's
very likely that adding an extra indirection in cases where
__memcpy_generic is going to be called anyway is going to be expensive
given that a bulk of the memcpy calls will be for small sizes of less
than 1k.
Note that the falkor version does quite well in memcpy-random across several
micro architectures so I think parts of it could be moved into the generic code.

Allowing a PLT only for __memcpy_chk and mempcpy would need a test case
waiver in check_localplt and that would become a blanket OK for PLT
usage for memcpy, which we don't want.  Hence my patch is probably the
best compromise, especially since there is precedent for the approach in
x86.
I still can't see any reason to even support these entry points in GLIBC, let
alone optimize them using ifuncs. The _chk functions should obviously be
inlined to avoid all the target specific complexity for no benefit. I think this
could trivially be done via the GLIBC headers already. (That's assuming they
are in any way performance critical.)

Wilco


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]