This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 3/4] sparc: Use default memcpy for rtld objects



On 05/10/2017 13:49, David Miller wrote:
> From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
> Date: Thu,  5 Oct 2017 10:51:11 -0300
> 
>> Both SPARC support multiarch platforms (sparcv9 and sparc64) have the
>> a default assembly implemented memcpy.  Since it should not be any
>> restriction about it them on the loader object and assuming they are
>> faster than generic ones this patch uses them for rtld objects.
>>
>> Also, there is no indication neither on original patch [1] or in commit
>> message why the generic one where used instead of the sparc optimized
>> ones.
> 
> The ultra1 memcpy is really an extremely non-ideal variant to use as
> the default for anything.
> 
> It's much slower on newer cpus, as the block loads and stores used in
> the ultra1 version aren't optimized the same way they were in those
> older chips.
> 
> The C version is faster on newer cpus and definitely a better choice
> as a default, especially because it doesn't use any cpu specific
> instructions like the ultra1 variant does.
> 
> In the Linux kernel we have an assembler version we use as the default
> which doesn't use any special instructions.

Thanks for the explanation, although it does not explain why the ultra1
is currently the default for sparc64 (sysdeps/sparc/sparc64/memcpy.S)
and also the default selection for multiarch.  The C version is used
solely for loader currently.

I tried to check which are the performance of C implementation against
ultra1 one on a niagara5 and results are:

  - on bench-memcpy the C version is slight slower for sizes up to
    32 (about 4% faster for sizes up to 16, 40% from 16 to 32 and
    50% up to 32).  It is definitely faster for sizes higher than
    64 (62% faster for sizes from 64 to 128 and 85% for sizes
    higher than 128).

  - on bench-memcpy-random shows no performance difference, however
    bench-memcpy-large shows the C implementation is indeed faster
    for all inputs.

So I think that instead of using default memcpy for rtld, the best
strategy would to use the C implementation instead as default and
add ultra1 as another option for ifunc resolution.


Attachment: bench-memcpy-random-sparc64.out
Description: Text document

Attachment: bench-memcpy-sparc64.out
Description: Text document

Attachment: bench-memcpy-large-sparc64.out
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]