This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.


On 09/11/2013 12:35 PM, OndÅej BÃlka wrote:
> On Mon, Sep 09, 2013 at 04:59:21PM -0400, Carlos O'Donell wrote:
>> On 09/09/2013 12:11 PM, OndÅej BÃlka wrote:
>>> This is actual implmentation. We use optimized header that makes calls
>>> around 50 cycles faster for nehalem and ivy bridge.
>>>
>>> Currently this improves strcpy, stpcpy, ctrcat I keep old implementation
>>> of strncpy/strncat.
>>>
>>> A header that I use improves speed by 10% on most processors for gcc
>>> workload. Separate loops that use ssse3/shifts are needed as this
>>> implemenation is slower on large sizes for processors without fast
>>> unaligned loads.
>>>
>>> Results were obtained by following benchmark:
>>>
>>> http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile.html
>>> http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile90913.tar.bz2
>>  
>> The benchmark numbers are great. I appreciate you running the various
>> tests against the old, new, and sse3 implementations.
>>
>> Does the glibc microbenchmark show a performance increase also or are
>> we still lacking the requisite framework to measure these changes?
>>
> There are several areas lacking, one not calling function in tight loop
> to take effect of branch prediction into account. Numbers from
> benchtests tend to be off by large amount due lack of randomization.
> 
> For example a strcpy-ssse3.S handles first 16 bytes with following code:
> 
>   cmpb  $0, (%rcx)
>   jz  L(Exit1)
>   cmpb  $0, 1(%rcx)
>   jz  L(Exit2)
>   cmpb  $0, 2(%rcx)
>   jz  L(Exit3)
>   cmpb  $0, 3(%rcx)
>   jz  L(Exit4)
>   cmpb  $0, 4(%rcx)
>   jz  L(Exit5)
>   cmpb  $0, 5(%rcx)
>   jz  L(Exit6)
>   cmpb  $0, 6(%rcx)
>   jz  L(Exit7)
>   cmpb  $0, 7(%rcx)
>   jz  L(Exit8)
>  ...
> 
> when size varies then it will degrade performance but benchmarks do not
> catch this.

This looks like it shouldn't be too hard to fix in the current benchmarks,
do you have any suggestions or patches?

> There is other problem how to compare old and new implementations. You
> have two tables of results before and after and now you want to compare
> them.
> 
> This is design problem main use case of benchmarks is comparing
> implementations it should be easy to do without having do tedious tasks
> like adding functions to makefile, ifunc-impl-list, renaming them,
> recompiling libc and running benchmarks.

I don't disagree. We need automation to be able to compare two runs
of the benchmarks. It's important to have this in order to do comparisons.

Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]