This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 0/2] Multiarch hooks for memcpy variants
On 8/14/2017 8:22 AM, Wilco Dijkstra wrote:
Siddhesh Poyarekar wrote:
The first part is not true for falkor since its implementation is a good
10-15% faster on the falkor chip due to its design differences. glibc
makes pretty extensive use of memcpy throughout, but I don't have data
on how much difference a core-specific memcpy will make there, so I
don't have enough grounds for a generic change.
66% of memcpy calls are <=16 bytes. Assuming you can even get a 15% gain
for these small sizes (there is very little you can do different), that's at most 1
cycle faster, so the PLT indirection is going to be more expensive.
It is important to be careful about overemphasizing the frequency of
short memcpy calls.
Even though a high percentage of memcpy calls are short, my experience
is that a high
percentage of time spent in memcpy is on longer copies.
Following example is just that, an example, not an expression of any
specific real application behavior:
If 66% of calls are <=16 bytes (average length=8, say) but the average
length of the remaining
1/3 of calls was 1K bytes (i.e. > 100 times as long), then the vast
majority of time
in memcpy would be in the longer copies.
My experience with tuning libc memcpy off and on on multiple platforms
is that copies
of length > 256 bytes are the ones that affect overall application
performance. Really short
copies where the length and/or alignment might be known at compile time
are best handled
by inlining the copy.
I've produced platform specific optimizations for memcpy many times over
the years. By platform
specific, I mean different code for different generations/platforms of
the same architecture.
These versions have shown improvements from at little as 10% to as much
as 250%
depending on how close the memory architecture of latest platform is to
the prior platform.
Typical factors that can influence best memcpy performance a specific
platform for a given architecture include:
ideal prefetch distance ... depends on processor speed, cache/memory
latency, depth of memory
subsystem queues, details of memory subsystem priorities for
prefetch vs demand fetch, and more.
number of alu operations that can be issued per cycle
number of memory operations that can be issued per cycle
number of total instructions that can be issued per cycle
branch misprediction latency; branch predictor behavior; other branch issues
and many other architectural features which can make occasional differences
I find it hard to imagine a single generic memcpy library routine that
can match the performance
of a platform specific tuned routine over a typical range of copy
lengths, assuming the architecture
has been around long enough to go through several semiconductor process
redesigns.
With dynamic linking, the overhead of using platform specific code for
something frequently
called should be relatively minimal.
I do agree a good generic version should be available as the effort of
finding the best tuning
for a particular platform can take weeks and not all
architecture/platform combinations
will get that intense attention.
- patrick mcgehearty
Your last point about hurting everything else is very valid though; it's
very likely that adding an extra indirection in cases where
__memcpy_generic is going to be called anyway is going to be expensive
given that a bulk of the memcpy calls will be for small sizes of less
than 1k.
Note that the falkor version does quite well in memcpy-random across several
micro architectures so I think parts of it could be moved into the generic code.
Allowing a PLT only for __memcpy_chk and mempcpy would need a test case
waiver in check_localplt and that would become a blanket OK for PLT
usage for memcpy, which we don't want. Hence my patch is probably the
best compromise, especially since there is precedent for the approach in
x86.
I still can't see any reason to even support these entry points in GLIBC, let
alone optimize them using ifuncs. The _chk functions should obviously be
inlined to avoid all the target specific complexity for no benefit. I think this
could trivially be done via the GLIBC headers already. (That's assuming they
are in any way performance critical.)
Wilco