This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>, Siddhesh Poyarekar <siddhesh at redhat dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Sat, 28 Sep 2013 15:40:10 -0400
- Subject: Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.
- Authentication-results: sourceware.org; auth=none
- References: <20130909153051 dot GA23047 at domone dot kolej dot mff dot cuni dot cz> <20130909161112 dot GB23047 at domone dot kolej dot mff dot cuni dot cz> <522E36A9 dot 8040100 at redhat dot com> <20130911163551 dot GB7675 at domone dot kolej dot mff dot cuni dot cz>
On 09/11/2013 12:35 PM, OndÅej BÃlka wrote:
> On Mon, Sep 09, 2013 at 04:59:21PM -0400, Carlos O'Donell wrote:
>> On 09/09/2013 12:11 PM, OndÅej BÃlka wrote:
>>> This is actual implmentation. We use optimized header that makes calls
>>> around 50 cycles faster for nehalem and ivy bridge.
>>>
>>> Currently this improves strcpy, stpcpy, ctrcat I keep old implementation
>>> of strncpy/strncat.
>>>
>>> A header that I use improves speed by 10% on most processors for gcc
>>> workload. Separate loops that use ssse3/shifts are needed as this
>>> implemenation is slower on large sizes for processors without fast
>>> unaligned loads.
>>>
>>> Results were obtained by following benchmark:
>>>
>>> http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile.html
>>> http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile90913.tar.bz2
>>
>> The benchmark numbers are great. I appreciate you running the various
>> tests against the old, new, and sse3 implementations.
>>
>> Does the glibc microbenchmark show a performance increase also or are
>> we still lacking the requisite framework to measure these changes?
>>
> There are several areas lacking, one not calling function in tight loop
> to take effect of branch prediction into account. Numbers from
> benchtests tend to be off by large amount due lack of randomization.
>
> For example a strcpy-ssse3.S handles first 16 bytes with following code:
>
> cmpb $0, (%rcx)
> jz L(Exit1)
> cmpb $0, 1(%rcx)
> jz L(Exit2)
> cmpb $0, 2(%rcx)
> jz L(Exit3)
> cmpb $0, 3(%rcx)
> jz L(Exit4)
> cmpb $0, 4(%rcx)
> jz L(Exit5)
> cmpb $0, 5(%rcx)
> jz L(Exit6)
> cmpb $0, 6(%rcx)
> jz L(Exit7)
> cmpb $0, 7(%rcx)
> jz L(Exit8)
> ...
>
> when size varies then it will degrade performance but benchmarks do not
> catch this.
This looks like it shouldn't be too hard to fix in the current benchmarks,
do you have any suggestions or patches?
> There is other problem how to compare old and new implementations. You
> have two tables of results before and after and now you want to compare
> them.
>
> This is design problem main use case of benchmarks is comparing
> implementations it should be easy to do without having do tedious tasks
> like adding functions to makefile, ifunc-impl-list, renaming them,
> recompiling libc and running benchmarks.
I don't disagree. We need automation to be able to compare two runs
of the benchmarks. It's important to have this in order to do comparisons.
Cheers,
Carlos.