This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Add x86-64 memmove with unaligned load/store and rep movsb
- From: Carlos O'Donell <carlos at redhat dot com>
- To: "H.J. Lu" <hjl dot tools at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Tue, 29 Mar 2016 13:41:55 -0400
- Subject: Re: [PATCH] Add x86-64 memmove with unaligned load/store and rep movsb
- Authentication-results: sourceware.org; auth=none
- References: <CAMe9rOopQ5rUGgH2vu9Xwe02Qw0UNrVNCNOAakiV7h0ukciMtQ at mail dot gmail dot com>
On 03/29/2016 12:58 PM, H.J. Lu wrote:
> The goal of this patch is to replace SSE2 memcpy.S,
> memcpy-avx-unaligned.S and memmove-avx-unaligned.S as well as
> provide SSE2 memmove with faster alternatives. bench-memcpy and
> bench-memmove data on various Intel and AMD processors are at
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=19776
>
> Any comments, feedbacks?
I assume this is a WIP? I don't see how this code replaces the memcpy@GLIBC_2.14
IFUNC we're currently using, or redirects the IFUNC to use your new functions
under certain conditions.
For memcpy:
* On ivybridge the new code regresses 9% mean performance versus AVX usage?
* On penryn the new code regresses 18% mean performance versus SSE2 usage?
* On bulldozer the new code regresses 18% mean performance versus AVX usage,
and 3% versus SSE2 usage?
This means that out of 11 hardware configurations the patch regresses 4
of those configurations, while progressing 7. If all devices are of equal
value, then this change is of mixed benefit.
Which is a mean improvement of 14% in the cases which improved, and a mean
degradation of 12% in the cases which had worse performance.
This seems like a bad change for Ivybridge, Penry, and Bulldozer.
Can you explain the loss of performance in terms of the hardware that is
impacted, why did it do worse?
Is it possible to limit the change to those key architectures where the
optimizations make a difference? Are you trying to avoid the maintenance
burden of yet another set of optimized routines?
--
Cheers,
Carlos.