This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] Improve strcpy: Faster ssse3 version.

From: "Carlos O'Donell" <carlos at redhat dot com>
To: OndÅej BÃlka <neleai at seznam dot cz>
Cc: Andreas Schwab <schwab at linux-m68k dot org>, libc-alpha at sourceware dot org
Date: Tue, 10 Sep 2013 11:36:31 -0400
Subject: Re: [RFC] Improve strcpy: Faster ssse3 version.
Authentication-results: sourceware.org; auth=none
References: <20130909153051 dot GA23047 at domone dot kolej dot mff dot cuni dot cz> <20130909161112 dot GB23047 at domone dot kolej dot mff dot cuni dot cz> <mvmbo42dkiq dot fsf at hawking dot suse dot de> <20130909171703 dot GA32141 at domone dot kolej dot mff dot cuni dot cz> <87ob81c1yk dot fsf at igel dot home> <20130909191829 dot GA997 at domone dot kolej dot mff dot cuni dot cz> <522E28E9 dot 5000709 at redhat dot com> <20130910142117 dot GB6536 at domone dot kolej dot mff dot cuni dot cz> <20130910151948 dot GA5337 at domone dot kolej dot mff dot cuni dot cz>

On 09/10/2013 11:19 AM, OndÅej BÃlka wrote:
> Hi, 
> 
> I also wrote a ssse3 version with same optimized header. 
> 
> On core2 and xeon it has similar performance to unaligned loads for small inputs
> and is slightly faster than current ssse3 on large inputs.
> http://kam.mff.cuni.cz/~ondra/benchmark_string/core2/strcpy_profile/results_rand/result.html
> These factors cause this implementation to be 20% faster on profiling in
> block mode. There inputs are bit atypical as most time is spend bash and
> it copies quite large strings which make ssse3 version faster than
> unaligned one.
> http://kam.mff.cuni.cz/~ondra/benchmark_string/core2/strcpy_profile/results_gcc/result.html
> 
> A change is in loop and code to setup/cleanup loop so what is best way to add this?

I don't understand, what tradeoff are you looking at?

> Also there would be third implementation by that mechanically replacing palignr
> with shifts avoids ssse3. How should incorporate that?

That's a good question, and I don't have a good answer.

> Currently I use separate files which are almost identical, diff is below.
> 
> Comments?

We have two options as I see it.

Macroize everything and use templates with different macro implementations.

or

Use assembler functions.

My preference, from working in the Linux kernel, is to use assembler
functions (which are used extensively in several ports). I've always felt
that assembler functions were easier to write and maintain than long 
rambling \ delimited macros that follow often annoying cpp macro rules.
You still have template files, but you use macros for big functions,
and single instruction replacements can still use a cpp macro.

Comments?

Cheers,
Carlos.

References:
- [PATCH 1/2] Improve strcpy: Rename strcpy-sse2-unaligned.S.
  - From: OndÅej BÃlka
- [PATCH 2/2] Improve strcpy: Faster unaligned loads.
  - From: OndÅej BÃlka
- Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.
  - From: Andreas Schwab
- Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.
  - From: OndÅej BÃlka
- Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.
  - From: Andreas Schwab
- Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.
  - From: OndÅej BÃlka
- Re: [PATCH 2/2] Improve strcpy: Faster unaligned loads.
  - From: Carlos O'Donell
- Re: [PATCH v2] Improve strcpy: Faster unaligned loads.
  - From: OndÅej BÃlka
- [RFC] Improve strcpy: Faster ssse3 version.
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]