This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 0/3] sparc M7 optimized memcpy/memset



On 19/10/2017 14:46, Patrick McGehearty wrote:
> On 10/5/2017 7:19 AM, Adhemerval Zanella wrote:
>>
>> On 04/10/2017 18:07, Patrick McGehearty wrote:
>>> On 9/27/2017 11:41 PM, David Miller wrote:
>>>> From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
>>>> Date: Wed, 27 Sep 2017 16:09:56 -0400
>>>>
>>>>> The following patch set provides optimized versions of
>>>>> memcpy/mempcpy/memmove/memset/bzero for Sparc versions M7 (and later).
>>>>> Support for recognizing M7 is also provided.
>>>>> An assembly version of memmove for ultra1+ is provided.
>>>>>
>>>>> Jose E. Marchesi (2):
>>>>>     sparc: support the ADP hw capability.
>>>>>     sparc: assembly version of memmove for ultra1+
>>>>>
>>>>> Patrick McGehearty (1):
>>>>>     sparc: M7 optimized memcpy/mempcpy/memmove/memset/bzero.
>>>> Looks good to me.
>>> Could someone install the 9/28/2017 sparc M7 optimized memcpy/memset
>>> patches to the repository?
>>>
>>> As a new submitter, I don't have permissions.
>> What about https://sourceware.org/ml/libc-alpha/2017-10/msg00125.html ?
>> I am finishing up the ifunc C implementation and I plan to send it
>> to review today.
> In https://sourceware.org/ml/libc-alpha/2017-10/msg00125.html,
> I see three issues which I will cover here.
> 
> Performance:
> 
> I reported:
>>>> For memcpy/mempcpy/memmove, performance comparison with niagara4 code:
>>>> Long word aligned data
>>>>    0-127 bytes - minimal changes
>>>>    128-1023 bytes - 7-30% gain
>>>>    1024+ bytes - 1-7% gain (in cache); 30-100% gain (out of cache)
>>>> Word aligned data
>>>>    0-127 bytes - 50%+ gain
>>>>    128-1023 bytes - 10-200% gain
>>>>    1024+ bytes - 0-15% gain (in cache); 5-50% gain (out of cache)
>>>> Unaligned data
>>>>    0-127 bytes - 0-70%+ gain
>>>>    128-447 bytes - 40-80%+ gain
>>>>    448-511 bytes - 1-3% loss
>>>>    512-4096 bytes - 2-3% gain (in cache); 0-20% gain (out of cache)
>>>>    4096+ bytes - +/- 3% (in cache); 20-50% gain (out of cache)
>>>>
>>>> For memset/bzero, performance comparison with niagara4 code:
>>>> For memset nonzero data,
>>>>    256-1023 bytes - 60-90% gain (in cache); 5% gain (out of cache)
>>>>    1K+ bytes - 80-260% gain (in cache); 40-80% gain (out of cache)
>>>> For memset zero data (and bzero),
>>>>    256-1023 bytes - 80-120% gain (in cache), 0% gain (out of cache)
>>>>    1024+ bytes - 2-4x gain (in cache), 10-35% gain (out of cache)
> 
> Adhemerval asked:
>>Which benchmark did you use to get theses values? If it where obtained
>>with benchtests one please attach the resulting files.  If not, please
>>either indicate how to reproduce the data or work to improve benchtests
>>with required datapoints.
> 
> A fair question. As noted, the performance comparisions are with the
> niagara4 code for memcpy and memset, the current default for m7/t7/s7
> sparc platforms.
> 
> I used a personal perf test suite that I developed over a period of
> years while tuning memcpy/memset for Sparc/Solaris for most Sparc
> platforms going back to UltraSparc IV. The tests are run in isolation
> and not quickly adaptable to the existing Linux bench tests
> structure. I also ran the Linux bench tests. The results from
> the Linux bench tests are generally similar to the numbers shown
> above, with some individual differences due to different ordering
> of test sizes which causes different branch prediction and cache
> warmth behavior. Because these tests give similar results, I see
> no particular need to add my tests to the Linux bench tests.
> I've continued to use my tests because of my long familiarity them.

I presume by 'Linux bench tests' you referring to glibc benchtests,
right? Usually with performance enhancement is usual to send the
benchmark data from improved symbols along with the patch submission.

> 
> As possible interest for those designing "out of cache" tests, the
> approach used in my personal tests is to allocate a "cache clearing
> buffer" at least 4 times the size of the largest cache on the system
> under test. In these tests, the L3 cache size is 64 MBytes and my
> buffer size was 256 MBytes. Before each set of timing runs,
> my program wrote new data to each long word address of the
> cache clearing buffer. Then, during the timing test for each
> combination of src and destination alignments and test lengths,
> successive copies were executed with each successive src and dest
> started at least 512 bytes after the end of the previous copy.
> The innermost loop did 20 copies. Each timing test was repeated
> 7 times with the median being used to minimize noise effects
> of other system activity.

If you could improve the mem* benchtests with this suggestion by
sending patches to either benchmark changes or new benchmarks
it will be welcome.

> 
>> Adhemerval asked that memcpy and memset be in separate patches
>> within the patch set.
> 
> If it is necessary to resubmit the patch set for other reasons,
> I will split memcpy and memset. Without the need to resubmit,
> it feels like 'make work' in my opinion. The optimizations to both
> files are logically similar and driven by the architectural
> changes in the Block Init Store (BIS) instruction.
> 
>> Adhemerval asked that the ifunc selector code use C instead of assembly.
>> He has prepared changes to replace all assembly uses of ifunc with C.
> 
> The existing ifunc code in that directory uses assembly.
> Adhemerval's changes were not available when I submitted the patch.
> As far as I understand, they are still not available in the main branch.
> Making those changes increases risk for my memcpy/memset optimizations.
> I consider such work, while worthy in its own right, to be out of
> scope for my patch set.
> 
> Respectively,
> - Patrick McGehearty

I just sent a patchset to refactor all remaining IFUNC resolver still
in assembly to C, including all the missing sparc ones.  

Also, I adjusted your patches to my refactor in personal branch [2] and
I also split the patch in memcpy/memmove and memset/bzero. It simplifies 
a lot new ifunc inclusions, for instance the memcpy part is just:

---
diff --git a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
index 46f3795..dbdad2d 100644 (file)
--- a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
+++ b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
@@ -19,6 +19,7 @@
 
 #include <ifunc-init.h>
 
+extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara7) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara4) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara2) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara1) attribute_hidden;
@@ -28,6 +29,8 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (ultra1) attribute_hidden;
 static inline void *
 IFUNC_SELECTOR (int hwcap)
 {
+  if (hwcap & HWCAP_SPARC_ADP)
+    return OPTIMIZE (niagara7);
   if (hwcap & HWCAP_SPARC_CRYPTO)
     return OPTIMIZE (niagara4);
   if (hwcap & HWCAP_SPARC_N2)
---

So if you could help with any review I will be thankful. I would expect
the memcpy/memmove and memset/bzero refactor to be straightforward.

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/ifunc-c-sparc-m7


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]