This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 3/3] sparc: M7 optimized memcpy/mempcpy/memmove/memset/bzero.



On 28/09/2017 16:43, Patrick McGehearty wrote:
> Responses after the comments. Apologies for the earlier incomplete message.
> 
> On 9/28/2017 11:17 AM, Adhemerval Zanella wrote:
>>
>> On 27/09/2017 13:09, Patrick McGehearty wrote:
>>> Tested in sparcv9-*-* and sparc64-*-* targets in both multi and
>>> non-multi arch configurations.
>>>
>>> Support added to identify Sparc M7/T7/S7/M8/T8 processor capability.
>>> Usual "make check" correctness tests run with no regressions.
>>> Performance tests run on Sparc S7 using new code and old niagara4 code.
>>>
>>> Optimizations for memcpy also apply to mempcpy and memmove
>>> where they share code. Optimizations for memset also apply
>>> to bzero as they share code.
>>>
>>> For memcpy/mempcpy/memmove, performance comparison with niagara4 code:
>>> Long word aligned data
>>>    0-127 bytes - minimal changes
>>>    128-1023 bytes - 7-30% gain
>>>    1024+ bytes - 1-7% gain (in cache); 30-100% gain (out of cache)
>>> Word aligned data
>>>    0-127 bytes - 50%+ gain
>>>    128-1023 bytes - 10-200% gain
>>>    1024+ bytes - 0-15% gain (in cache); 5-50% gain (out of cache)
>>> Unaligned data
>>>    0-127 bytes - 0-70%+ gain
>>>    128-447 bytes - 40-80%+ gain
>>>    448-511 bytes - 1-3% loss
>>>    512-4096 bytes - 2-3% gain (in cache); 0-20% gain (out of cache)
>>>    4096+ bytes - +/- 3% (in cache); 20-50% gain (out of cache)
>>>
>>> For memset/bzero, performance comparison with niagara4 code:
>>> For memset nonzero data,
>>>    256-1023 bytes - 60-90% gain (in cache); 5% gain (out of cache)
>>>    1K+ bytes - 80-260% gain (in cache); 40-80% gain (out of cache)
>>> For memset zero data (and bzero),
>>>    256-1023 bytes - 80-120% gain (in cache), 0% gain (out of cache)
>>>    1024+ bytes - 2-4x gain (in cache), 10-35% gain (out of cache)

Which benchmark did you use to get theses values? If it where obtained
with benchtests one please attach the resulting files.  If not, please
either indicate how to reproduce the data or work to improve benchtests
with required datapoints.

>>> ---
>>>   ChangeLog                                          |   20 +
>>>   sysdeps/sparc/sparc32/sparcv9/multiarch/Makefile   |    3 +-
>>>   .../sparcv9/multiarch/memcpy-memmove-niagara7.S    |    2 +
>>>   sysdeps/sparc/sparc32/sparcv9/multiarch/memmove.S  |    2 +
>>>   .../sparc32/sparcv9/multiarch/memset-niagara7.S    |    2 +
>>>   .../sparc/sparc32/sparcv9/multiarch/rtld-memmove.c |    1 +
>>>   sysdeps/sparc/sparc64/multiarch/Makefile           |    3 +-
>>>   sysdeps/sparc/sparc64/multiarch/ifunc-impl-list.c  |   13 +
>>>   .../sparc64/multiarch/memcpy-memmove-niagara7.S    |  979 ++++++++++++++++++++
>>>   sysdeps/sparc/sparc64/multiarch/memcpy.S           |   28 +-
>>>   sysdeps/sparc/sparc64/multiarch/memmove.S          |   72 ++
>>>   sysdeps/sparc/sparc64/multiarch/memset-niagara7.S  |  334 +++++++
>>>   sysdeps/sparc/sparc64/multiarch/memset.S           |   28 +-
>>>   sysdeps/sparc/sparc64/multiarch/rtld-memmove.c     |    1 +
>>>   14 files changed, 1482 insertions(+), 6 deletions(-)
>>>   create mode 100644 sysdeps/sparc/sparc32/sparcv9/multiarch/memcpy-memmove-niagara7.S
>>>   create mode 100644 sysdeps/sparc/sparc32/sparcv9/multiarch/memmove.S
>>>   create mode 100644 sysdeps/sparc/sparc32/sparcv9/multiarch/memset-niagara7.S
>>>   create mode 100644 sysdeps/sparc/sparc32/sparcv9/multiarch/rtld-memmove.c
>>>   create mode 100644 sysdeps/sparc/sparc64/multiarch/memcpy-memmove-niagara7.S
>>>   create mode 100644 sysdeps/sparc/sparc64/multiarch/memmove.S
>>>   create mode 100644 sysdeps/sparc/sparc64/multiarch/memset-niagara7.S
>>>   create mode 100644 sysdeps/sparc/sparc64/multiarch/rtld-memmove.c
>> I would prefer if you split this patch in a memcpy/mempcpy/memmove and
>> a memset/bzero one since they are separated implementations.
> I can see your point, and if I had started that way, it would not have been a problem.  At this point,
> it would take some effort to redo the patch structure at this point given our internal development
> process and might significantly delay the completion of this patch due to requirements to make
> progress on other tasks. From a user's point of view, the optimizations for memcpy on M7/T7/S7
> and memset on M7/T7/S7 both are driven by the use of block initializing store as a 64 byte operation
> instead of a 32 byte operation.  (side note: block initializing store on M8/T8 is also 64 bytes, so these
> optimizations are also appropriate for the new M8/T8 systems). Thus, a user should be equally
> willing to use both or neither of the new memcpy/memset optimizations.

This is not really a good reason to *not* structure better the 
pachset, and the change does not relly require extensive changes
(rather to move the memset bits not another patch). Please split
the patch in future submission.

> 
> 
>>
>> I also noticed this whitespace issues, so you might want to check this out:
>>
>> ---
>>
>> patch03:886: trailing whitespace.
>>   *      lines from memory. Use ST_CHUNK stores to first element of each cache
>> patch03:890: space before tab in indent.
>>          andn    %o2, 0x3f, %o5          /* %o5 is multiple of block size  */
>> patch03:891: space before tab in indent.
>>          and     %o2, 0x3f, %o2          /* residue bytes in %o2  */
>> patch03:892: trailing whitespace.
>>   patch03:1216: new blank line at EOF.
>> +
>> warning: 5 lines add whitespace errors.
>> ---
> I've cleaned up the whitespace issues. Thank you for catching them.
> 
>>
>>> diff --git a/sysdeps/sparc/sparc64/multiarch/memcpy.S b/sysdeps/sparc/sparc64/multiarch/memcpy.S
>>> index b6396ee..d72f4b1 100644
>>> --- a/sysdeps/sparc/sparc64/multiarch/memcpy.S
>>> +++ b/sysdeps/sparc/sparc64/multiarch/memcpy.S
>>> @@ -27,7 +27,19 @@ ENTRY(memcpy)
>>>   # ifdef SHARED
>>>       SETUP_PIC_REG_LEAF(o3, o5)
>>>   # endif
>>> -    set    HWCAP_SPARC_CRYPTO, %o1
>>> +    set    HWCAP_SPARC_ADP, %o1
>>> +    andcc    %o0, %o1, %g0
>>> +    be    1f
>>> +     nop
>>> +# ifdef SHARED
>>> +    sethi    %gdop_hix22(__memcpy_niagara7), %o1
>>> +    xor    %o1, %gdop_lox10(__memcpy_niagara7), %o1
>>> +# else
>>> +    set    __memcpy_niagara7, %o1
>>> +# endif
>>> +    ba    10f
>>> +     nop
>>> +1:      set    HWCAP_SPARC_CRYPTO, %o1
>>>       andcc    %o0, %o1, %g0
>>>       be    1f
>>>        andcc    %o0, HWCAP_SPARC_N2, %g0
>>> @@ -89,7 +101,19 @@ ENTRY(__mempcpy)
>>>   # ifdef SHARED
>>>       SETUP_PIC_REG_LEAF(o3, o5)
>>>   # endif
>>> -    set    HWCAP_SPARC_CRYPTO, %o1
>>> +    set     HWCAP_SPARC_ADP, %o1
>>> +    andcc   %o0, %o1, %g0
>>> +        be      1f
>>> +     nop
>>> +# ifdef SHARED
>>> +    sethi    %gdop_hix22(__mempcpy_niagara7), %o1
>>> +    xor    %o1, %gdop_lox10(__mempcpy_niagara7), %o1
>>> +# else
>>> +    set    __mempcpy_niagara7, %o1
>>> +# endif
>>> +    ba      10f
>>> +     nop
>>> +1:    set    HWCAP_SPARC_CRYPTO, %o1
>>>       andcc    %o0, %o1, %g0
>>>       be    1f
>>>        andcc    %o0, HWCAP_SPARC_N2, %g0
>> Since you are touching this file, I think it would be better to change the
>> ifunc selector to a C implementation by using the __ifunc/sparc_libc_ifunc
>> macros.
> Again, I agree philosophically, but the actual code changes for ADP were made many months ago.
> Revising those changes now would require more work with little perceived benefit.

SPARC is the only one still using ifunc resolvers coded in assembly and 
currently there is no gain in continuing doing so. You can rebase your next
version against azanella/sparc64-ifunc [1] which I intend to send upstream
shortly.

It refactors sparc64 memcpy and memset ifunc selection current implementation
to C. I think I get the selector logic correct and I checked on a UltraSparc T5, 
so please tell if I got something wrong.

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/sparc64-ifunc


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]