This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Patch, AArch64] Optimized strcpy


On Thu, Dec 18, 2014 at 10:55:25AM +0000, Richard Earnshaw wrote:
> On 18/12/14 01:05, OndÅej BÃlka wrote:
> > On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
> >> This patch contains an optimized implementation of strcpy for AArch64
> >> systems.  Benchmarking shows that it is approximately 20-25% faster than
> >> the generic implementation across the board.
> >>
> > I looked quickly for patch, I found two microoptimizations below and
> > probable performance problem.
> > 
> 
> Ondrej,
> 
> Thanks for looking at this.  Unfortunately, you've looked at the wrong
> version -- the version I accidentally posted first.  The correct version
> was a near complete rewrite following my own benchmarking: I posted that
> as a follow-up.
>
Yes, in new version its fixed and its probably best possible approach
for short strings. Now I see only few possible microoptimizations.
 
> > Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
> > shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
> > more likely than 9-16 bytes so you should optimize that case well.
> 
> I'd expect that.  But the key question is 'how much more likely'?

>From raw data to create graphs its around 600000 calls for 1-8 versus 260000 for 8-16, number of calls are here.

http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/functionbytes_10_3
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/functionbytes_100_3

> Having an inner loop dealing with 16 bytes at a time probably only costs
> about 5-10% more time *per iteration* than an inner loop dealing with 8
> bytes, since much of the cost is in waiting for the load instruction to
> return data from the cache or memory and most of the other instructions
> will dual-issue on any reasonable implementation of the architecture.
> So to win in the overall game of performance we'd need to show that
> short (<8 byte strings) were significantly more likely than 8-16 byte
> strings.  That seems quite unlikely to me, though I admit I don't have
> hard numbers; looking at your data suggests that (approximately) each
> larger block size is ~20% of the size of the previous block (I'm
> guessing these are 8-byte blocks, but I can't immediately see a
> definition), which suggests to me that we have a net win with preferring
> 16-bytes over 8 bytes per iteration.
> 
These are 16-byte blocks that are aligned to 16 bytes, there is short documentation that I should expand.
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/doc/properties.html

My objection was only for handling first 8 bytes. 

As loops are concerned a 16byte one is fine, I could even try what you
gain with 32byte but then code size could be problem.

I am not that worried about loop overhead, for each implementation you
could make workload where its slow. You could just shift that weakness
so its unlikely encountered. For 16-byte loop you would need lot of
32-64 byte strings but few 1-32 byte ones. OTOH 8-byte loop would have
problems with mainly 16-32 byte range but would be quite fast in 16-byte loop
as its handled by header.



> Note that starting off with 8-byte blocks and later switching to 16-byte
> blocks would most likely involve another unpredictable branch as we
> inserted the late realignment check.
> 
Yes, these do not work.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]