This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 0/3] sparc M7 optimized memcpy/memset
On 19/10/2017 14:46, Patrick McGehearty wrote:
> On 10/5/2017 7:19 AM, Adhemerval Zanella wrote:
>>
>> On 04/10/2017 18:07, Patrick McGehearty wrote:
>>> On 9/27/2017 11:41 PM, David Miller wrote:
>>>> From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
>>>> Date: Wed, 27 Sep 2017 16:09:56 -0400
>>>>
>>>>> The following patch set provides optimized versions of
>>>>> memcpy/mempcpy/memmove/memset/bzero for Sparc versions M7 (and later).
>>>>> Support for recognizing M7 is also provided.
>>>>> An assembly version of memmove for ultra1+ is provided.
>>>>>
>>>>> Jose E. Marchesi (2):
>>>>> sparc: support the ADP hw capability.
>>>>> sparc: assembly version of memmove for ultra1+
>>>>>
>>>>> Patrick McGehearty (1):
>>>>> sparc: M7 optimized memcpy/mempcpy/memmove/memset/bzero.
>>>> Looks good to me.
>>> Could someone install the 9/28/2017 sparc M7 optimized memcpy/memset
>>> patches to the repository?
>>>
>>> As a new submitter, I don't have permissions.
>> What about https://sourceware.org/ml/libc-alpha/2017-10/msg00125.html ?
>> I am finishing up the ifunc C implementation and I plan to send it
>> to review today.
> In https://sourceware.org/ml/libc-alpha/2017-10/msg00125.html,
> I see three issues which I will cover here.
>
> Performance:
>
> I reported:
>>>> For memcpy/mempcpy/memmove, performance comparison with niagara4 code:
>>>> Long word aligned data
>>>> 0-127 bytes - minimal changes
>>>> 128-1023 bytes - 7-30% gain
>>>> 1024+ bytes - 1-7% gain (in cache); 30-100% gain (out of cache)
>>>> Word aligned data
>>>> 0-127 bytes - 50%+ gain
>>>> 128-1023 bytes - 10-200% gain
>>>> 1024+ bytes - 0-15% gain (in cache); 5-50% gain (out of cache)
>>>> Unaligned data
>>>> 0-127 bytes - 0-70%+ gain
>>>> 128-447 bytes - 40-80%+ gain
>>>> 448-511 bytes - 1-3% loss
>>>> 512-4096 bytes - 2-3% gain (in cache); 0-20% gain (out of cache)
>>>> 4096+ bytes - +/- 3% (in cache); 20-50% gain (out of cache)
>>>>
>>>> For memset/bzero, performance comparison with niagara4 code:
>>>> For memset nonzero data,
>>>> 256-1023 bytes - 60-90% gain (in cache); 5% gain (out of cache)
>>>> 1K+ bytes - 80-260% gain (in cache); 40-80% gain (out of cache)
>>>> For memset zero data (and bzero),
>>>> 256-1023 bytes - 80-120% gain (in cache), 0% gain (out of cache)
>>>> 1024+ bytes - 2-4x gain (in cache), 10-35% gain (out of cache)
>
> Adhemerval asked:
>>Which benchmark did you use to get theses values? If it where obtained
>>with benchtests one please attach the resulting files. If not, please
>>either indicate how to reproduce the data or work to improve benchtests
>>with required datapoints.
>
> A fair question. As noted, the performance comparisions are with the
> niagara4 code for memcpy and memset, the current default for m7/t7/s7
> sparc platforms.
>
> I used a personal perf test suite that I developed over a period of
> years while tuning memcpy/memset for Sparc/Solaris for most Sparc
> platforms going back to UltraSparc IV. The tests are run in isolation
> and not quickly adaptable to the existing Linux bench tests
> structure. I also ran the Linux bench tests. The results from
> the Linux bench tests are generally similar to the numbers shown
> above, with some individual differences due to different ordering
> of test sizes which causes different branch prediction and cache
> warmth behavior. Because these tests give similar results, I see
> no particular need to add my tests to the Linux bench tests.
> I've continued to use my tests because of my long familiarity them.
I presume by 'Linux bench tests' you referring to glibc benchtests,
right? Usually with performance enhancement is usual to send the
benchmark data from improved symbols along with the patch submission.
>
> As possible interest for those designing "out of cache" tests, the
> approach used in my personal tests is to allocate a "cache clearing
> buffer" at least 4 times the size of the largest cache on the system
> under test. In these tests, the L3 cache size is 64 MBytes and my
> buffer size was 256 MBytes. Before each set of timing runs,
> my program wrote new data to each long word address of the
> cache clearing buffer. Then, during the timing test for each
> combination of src and destination alignments and test lengths,
> successive copies were executed with each successive src and dest
> started at least 512 bytes after the end of the previous copy.
> The innermost loop did 20 copies. Each timing test was repeated
> 7 times with the median being used to minimize noise effects
> of other system activity.
If you could improve the mem* benchtests with this suggestion by
sending patches to either benchmark changes or new benchmarks
it will be welcome.
>
>> Adhemerval asked that memcpy and memset be in separate patches
>> within the patch set.
>
> If it is necessary to resubmit the patch set for other reasons,
> I will split memcpy and memset. Without the need to resubmit,
> it feels like 'make work' in my opinion. The optimizations to both
> files are logically similar and driven by the architectural
> changes in the Block Init Store (BIS) instruction.
>
>> Adhemerval asked that the ifunc selector code use C instead of assembly.
>> He has prepared changes to replace all assembly uses of ifunc with C.
>
> The existing ifunc code in that directory uses assembly.
> Adhemerval's changes were not available when I submitted the patch.
> As far as I understand, they are still not available in the main branch.
> Making those changes increases risk for my memcpy/memset optimizations.
> I consider such work, while worthy in its own right, to be out of
> scope for my patch set.
>
> Respectively,
> - Patrick McGehearty
I just sent a patchset to refactor all remaining IFUNC resolver still
in assembly to C, including all the missing sparc ones.
Also, I adjusted your patches to my refactor in personal branch [2] and
I also split the patch in memcpy/memmove and memset/bzero. It simplifies
a lot new ifunc inclusions, for instance the memcpy part is just:
---
diff --git a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
index 46f3795..dbdad2d 100644 (file)
--- a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
+++ b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
@@ -19,6 +19,7 @@
#include <ifunc-init.h>
+extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara7) attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara4) attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara2) attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara1) attribute_hidden;
@@ -28,6 +29,8 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (ultra1) attribute_hidden;
static inline void *
IFUNC_SELECTOR (int hwcap)
{
+ if (hwcap & HWCAP_SPARC_ADP)
+ return OPTIMIZE (niagara7);
if (hwcap & HWCAP_SPARC_CRYPTO)
return OPTIMIZE (niagara4);
if (hwcap & HWCAP_SPARC_N2)
---
So if you could help with any review I will be thankful. I would expect
the memcpy/memmove and memset/bzero refactor to be straightforward.
[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/ifunc-c-sparc-m7