This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [Patch, AArch64] Optimized strcpy

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Richard Earnshaw <rearnsha at arm dot com>
Cc: Glibc Development List <libc-alpha at sourceware dot org>
Date: Thu, 18 Dec 2014 16:15:33 +0100
Subject: Re: [Patch, AArch64] Optimized strcpy
Authentication-results: sourceware.org; auth=none
References: <54917329 dot 4090601 at arm dot com> <5491759B dot 4020704 at arm dot com>

On Wed, Dec 17, 2014 at 12:22:51PM +0000, Richard Earnshaw wrote:
> On 17/12/14 12:12, Richard Earnshaw wrote:
> > This patch contains an optimized implementation of strcpy for AArch64
> > systems.  Benchmarking shows that it is approximately 20-25% faster than
> > the generic implementation across the board.
> > 
> > R.
> > 
> > <date>  Richard Earnshaw  <rearnsha@arm.com>
> > 
> > 	* sysdeps/aarch64/strcpy.S: New file.
> > 
> > 
> 
> Er, sorry.  That's the wrong version of the patch.
> 
> Here's the correct one.
> 
> R.
Microoptimizations I promised:

> +	ldp	data1, data2, [srcin]
> +	add	src, srcin, #16
> +	sub	tmp1, data1, zeroones
> +	orr	tmp2, data1, #REP8_7f
> +	sub	tmp3, data2, zeroones
> +	orr	tmp4, data2, #REP8_7f
> +	bic	has_nul1, tmp1, tmp2
> +	bics	has_nul2, tmp3, tmp4
> +	ccmp	has_nul1, #0, #0, eq	/* NZCV = 0000  */
> +	b.ne	L(early_end_found)

Flip branch and move copy of early_end_found here, its likely so it would reduce
instruction cache footprint.

> +	ldp	data1a, data2a, [srcin]
> +	stp	data1a, data2a, [dst], #16
> +	sub	dst, dst, to_align
> +	/* Everything is now set up, so we can just fall into the bulk
> +	   copy loop.  */
> +	/* The inner loop deals with two Dwords at a time.  This has a
> +	   slightly higher start-up cost, but we should win quite quickly,
> +	   especially on cores with a high number of issue slots per
> +	   cycle, as we get much better parallelism out of the operations.  */
> +L(main_loop):

Again try if aligning loop helps. It sometimes does not make difference
but sometimes loop is twice slower just because of misalignment. 


> +       ldp     data1, data2, [src], #16
> +       sub     tmp1, data1, zeroones
> +       orr     tmp2, data1, #REP8_7f
> +       sub     tmp3, data2, zeroones
> +       orr     tmp4, data2, #REP8_7f
> +       bic     has_nul1, tmp1, tmp2
> +       bics    has_nul2, tmp3, tmp4
> +       ccmp    has_nul1, #0, #0, eq    /* NZCV = 0000  */
> +       b.ne    L(early_end_found)

This check is unneccesary, its better jump to resume main check like
replacing it with 

b L(could_read_crosspage)

which goes here.

       tbnz    tmp2, #MIN_PAGE_P2, L(page_cross)
#endif
L(could_read_crosspage): 

You will check some bytes twice which is ok as this branch is almost
never executed.


> +
> +	/* The string is short (<32 bytes).  We don't know exactly how
> +	   short though, yet.  Work out the exact length so that we can
> +	   quickly select the optimal copy strategy.  */
> +L(early_end_found):
> +	cmp	has_nul1, #0
> +#ifdef __AARCH64EB__
> +	/* For big-endian, carry propagation (if the final byte in the
> +	   string is 0x01) means we cannot use has_nul directly.  The
> +	   easiest way to get the correct byte is to byte-swap the data
> +	   and calculate the syndrome a second time.  */
> +	csel	data1, data1, data2, ne
> +	rev	data1, data1
> +	sub	tmp1, data1, zeroones
> +	orr	tmp2, data1, #REP8_7f
> +	bic	has_nul1, tmp1, tmp2
> +#else
> +	csel	has_nul1, has_nul1, has_nul2, ne
> +#endif

Just use branch. You need to decide if string is 8 byte large anyway so
there is no additional misprediction (unless you optimize for size.)


> +L(lt16):
> +	/* 8->15 bytes to copy.  */
> +	ldr	data1, [srcin]

These loads is unnecessary in likely case when there is no page crossing.
You already read this at start.

> +	ldr	data2, [src, #-8]
> +	str	data1, [dstin]
> +	str	data2, [dst, #-8]
> +	ret
> +L(lt8):
> +	cmp	len, #4
> +	b.lt	L(lt4)
> +	/* 4->7 bytes to copy.  */
> +	ldr	data1w, [srcin]
> +	ldr	data2w, [src, #-4]

Same comment as before. You could also create data2w from data1 by
bit-shift. Test if on arm its faster than load.

> +	str	data1w, [dstin]
> +	str	data2w, [dst, #-4]
> +	ret
> +L(lt4):
> +	cmp	len, #2
> +	b.lt	L(lt2)
> +	/* 2->3 bytes to copy.  */
> +	ldrh	data1w, [srcin]
> +	strh	data1w, [dstin]
> +	/* Fall-through, one byte (max) to go.  */
> +L(lt2):
> +	/* Null-terminated string.  Last character must be zero!  */
> +	strb	wzr, [dst, #-1]
> +	ret
> +END (strcpy)
> +libc_hidden_builtin_def (strcpy)

References:
- [Patch, AArch64] Optimized strcpy
  - From: Richard Earnshaw
- Re: [Patch, AArch64] Optimized strcpy
  - From: Richard Earnshaw

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]