This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [Patch, AArch64] Optimized strcpy
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Richard Earnshaw <rearnsha at arm dot com>
- Cc: Glibc Development List <libc-alpha at sourceware dot org>
- Date: Thu, 18 Dec 2014 14:45:30 +0100
- Subject: Re: [Patch, AArch64] Optimized strcpy
- Authentication-results: sourceware.org; auth=none
- References: <54917329 dot 4090601 at arm dot com> <20141218010555 dot GA914 at domone> <5492B29D dot 4010303 at arm dot com>
On Thu, Dec 18, 2014 at 10:55:25AM +0000, Richard Earnshaw wrote:
> On 18/12/14 01:05, OndÅej BÃlka wrote:
> > On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
> >> This patch contains an optimized implementation of strcpy for AArch64
> >> systems. Benchmarking shows that it is approximately 20-25% faster than
> >> the generic implementation across the board.
> >>
> > I looked quickly for patch, I found two microoptimizations below and
> > probable performance problem.
> >
>
> Ondrej,
>
> Thanks for looking at this. Unfortunately, you've looked at the wrong
> version -- the version I accidentally posted first. The correct version
> was a near complete rewrite following my own benchmarking: I posted that
> as a follow-up.
>
Yes, in new version its fixed and its probably best possible approach
for short strings. Now I see only few possible microoptimizations.
> > Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
> > shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
> > more likely than 9-16 bytes so you should optimize that case well.
>
> I'd expect that. But the key question is 'how much more likely'?
>From raw data to create graphs its around 600000 calls for 1-8 versus 260000 for 8-16, number of calls are here.
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/functionbytes_10_3
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/functionbytes_100_3
> Having an inner loop dealing with 16 bytes at a time probably only costs
> about 5-10% more time *per iteration* than an inner loop dealing with 8
> bytes, since much of the cost is in waiting for the load instruction to
> return data from the cache or memory and most of the other instructions
> will dual-issue on any reasonable implementation of the architecture.
> So to win in the overall game of performance we'd need to show that
> short (<8 byte strings) were significantly more likely than 8-16 byte
> strings. That seems quite unlikely to me, though I admit I don't have
> hard numbers; looking at your data suggests that (approximately) each
> larger block size is ~20% of the size of the previous block (I'm
> guessing these are 8-byte blocks, but I can't immediately see a
> definition), which suggests to me that we have a net win with preferring
> 16-bytes over 8 bytes per iteration.
>
These are 16-byte blocks that are aligned to 16 bytes, there is short documentation that I should expand.
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/doc/properties.html
My objection was only for handling first 8 bytes.
As loops are concerned a 16byte one is fine, I could even try what you
gain with 32byte but then code size could be problem.
I am not that worried about loop overhead, for each implementation you
could make workload where its slow. You could just shift that weakness
so its unlikely encountered. For 16-byte loop you would need lot of
32-64 byte strings but few 1-32 byte ones. OTOH 8-byte loop would have
problems with mainly 16-32 byte range but would be quite fast in 16-byte loop
as its handled by header.
> Note that starting off with 8-byte blocks and later switching to 16-byte
> blocks would most likely involve another unpredictable branch as we
> inserted the late realignment check.
>
Yes, these do not work.