This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.

From: "Carlos O'Donell" <carlos at redhat dot com>
To: OndÅej BÃlka <neleai at seznam dot cz>, Siddhesh Poyarekar <siddhesh at redhat dot com>
Cc: libc-alpha at sourceware dot org
Date: Sat, 28 Sep 2013 15:40:10 -0400
Subject: Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.
Authentication-results: sourceware.org; auth=none
References: <20130909153051 dot GA23047 at domone dot kolej dot mff dot cuni dot cz> <20130909161112 dot GB23047 at domone dot kolej dot mff dot cuni dot cz> <522E36A9 dot 8040100 at redhat dot com> <20130911163551 dot GB7675 at domone dot kolej dot mff dot cuni dot cz>

On 09/11/2013 12:35 PM, OndÅej BÃlka wrote:
> On Mon, Sep 09, 2013 at 04:59:21PM -0400, Carlos O'Donell wrote:
>> On 09/09/2013 12:11 PM, OndÅej BÃlka wrote:
>>> This is actual implmentation. We use optimized header that makes calls
>>> around 50 cycles faster for nehalem and ivy bridge.
>>>
>>> Currently this improves strcpy, stpcpy, ctrcat I keep old implementation
>>> of strncpy/strncat.
>>>
>>> A header that I use improves speed by 10% on most processors for gcc
>>> workload. Separate loops that use ssse3/shifts are needed as this
>>> implemenation is slower on large sizes for processors without fast
>>> unaligned loads.
>>>
>>> Results were obtained by following benchmark:
>>>
>>> http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile.html
>>> http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile90913.tar.bz2
>>  
>> The benchmark numbers are great. I appreciate you running the various
>> tests against the old, new, and sse3 implementations.
>>
>> Does the glibc microbenchmark show a performance increase also or are
>> we still lacking the requisite framework to measure these changes?
>>
> There are several areas lacking, one not calling function in tight loop
> to take effect of branch prediction into account. Numbers from
> benchtests tend to be off by large amount due lack of randomization.
> 
> For example a strcpy-ssse3.S handles first 16 bytes with following code:
> 
>   cmpb  $0, (%rcx)
>   jz  L(Exit1)
>   cmpb  $0, 1(%rcx)
>   jz  L(Exit2)
>   cmpb  $0, 2(%rcx)
>   jz  L(Exit3)
>   cmpb  $0, 3(%rcx)
>   jz  L(Exit4)
>   cmpb  $0, 4(%rcx)
>   jz  L(Exit5)
>   cmpb  $0, 5(%rcx)
>   jz  L(Exit6)
>   cmpb  $0, 6(%rcx)
>   jz  L(Exit7)
>   cmpb  $0, 7(%rcx)
>   jz  L(Exit8)
>  ...
> 
> when size varies then it will degrade performance but benchmarks do not
> catch this.

This looks like it shouldn't be too hard to fix in the current benchmarks,
do you have any suggestions or patches?

> There is other problem how to compare old and new implementations. You
> have two tables of results before and after and now you want to compare
> them.
> 
> This is design problem main use case of benchmarks is comparing
> implementations it should be easy to do without having do tedious tasks
> like adding functions to makefile, ifunc-impl-list, renaming them,
> recompiling libc and running benchmarks.

I don't disagree. We need automation to be able to compare two runs
of the benchmarks. It's important to have this in order to do comparisons.

Cheers,
Carlos.

Follow-Ups:
- Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.
  - From: OndÅej BÃlka

References:
- [PATCH 1/2] Improve strcpy: Rename strcpy-sse2-unaligned.S.
  - From: OndÅej BÃlka
- [PATCH 2/2] Improve strcpy: Faster unaligned loads.
  - From: OndÅej BÃlka
- Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.
  - From: Carlos O'Donell
- Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]