This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH, AArch64] Add optimized strchrnul


On Fri, Jun 13, 2014 at 11:23:26AM +0100, Richard Earnshaw wrote:
> Here is an optimized implementation of __strchrnul.  The simplification
> that we don't have to track precisely why the loop terminates (match or
> end-of-string) means we have to do less work in both setup and the core
> inner loop.  That means this should never be slower than strchr.
> 
> As with strchr, the use of LD1 means we do not need different versions
> for big-/little-endian.
> 
> <date>  Richard Earnshaw  <rearnsha@arm.com>
> 
> 	* sysdeps/aarch64/strchrnul.S: New file.
> 
> OK?

Few comments, a hot path in strchrnul are first 64 bytes so you should
focus on these.

First get a profiler here. This is a simple program that collects a
sizes of strchr calls and then runs these again. It is good first
approximation of real performance.

http://kam.mff.cuni.cz/~ondra/dryrun_strchrnul.tar.bz2

After you collect calls from programs that interest you try to compare
these. A old/new implementation is minimum, but I have several
questions.

First what is latency of unaligned loads? One performance problem on x64
were small strings that cross 64byte boundary. It turned out that it is
faster first check if we do not cross page and then do unaligned comparison
on 64 bytes. That needs to be checked.

Second trick is first check page crossing, then align to 16 bytes and do
32byte compare so you always compare at least 16 valid bytes in header.



> +
> +ENTRY (__strchrnul)
> +	/* Magic constant 0x40100401 to allow us to identify which lane
> +	   matches the termination condition.  */
> +	mov	wtmp2, #0x0401
> +	movk	wtmp2, #0x4010, lsl #16
> +	dup	vrepchr.16b, chrin
> +	bic	src, srcin, #31		/* Work with aligned 32-byte hunks.  */
> +	dup	vrepmask.4s, wtmp2
> +	ands	tmp1, srcin, #31
> +	b.eq	L(loop)
> +
Omitting this branch would improve performance if code below still works
on aligned hunk. In loop there is risk of branch misprediction which is
slower than staying at head. Condition could also cause branch
misprediction but when I looked to data a 32-byte aligned strings are
rare.

> +	/* Input string is not 32-byte aligned.  Rather than forcing
> +	   the padding bytes to a safe value, we calculate the syndrome
> +	   for all the bytes, but then mask off those bits of the
> +	   syndrome that are related to the padding.  */
> +	ld1	{vdata1.16b, vdata2.16b}, [src], #32
> +	neg	tmp1, tmp1
> +	cmeq	vhas_nul1.16b, vdata1.16b, #0
> +	cmeq	vhas_chr1.16b, vdata1.16b, vrepchr.16b
> +	cmeq	vhas_nul2.16b, vdata2.16b, #0
> +	cmeq	vhas_chr2.16b, vdata2.16b, vrepchr.16b
> +	orr	vhas_chr1.16b, vhas_chr1.16b, vhas_nul1.16b
> +	orr	vhas_chr2.16b, vhas_chr2.16b, vhas_nul2.16b
> +	and	vhas_chr1.16b, vhas_chr1.16b, vrepmask.16b
> +	and	vhas_chr2.16b, vhas_chr2.16b, vrepmask.16b
> +	lsl	tmp1, tmp1, #1
> +	addp	vend1.16b, vhas_chr1.16b, vhas_chr2.16b	// 256->128
> +	mov	tmp3, #~0
> +	addp	vend1.16b, vend1.16b, vend1.16b		// 128->64
> +	lsr	tmp1, tmp3, tmp1
> +
> +	mov	tmp3, vend1.2d[0]
> +	bic	tmp1, tmp3, tmp1	// Mask padding bits.
> +	cbnz	tmp1, L(tail)
> +
> +L(loop):
> +	ld1	{vdata1.16b, vdata2.16b}, [src], #32
> +	cmeq	vhas_nul1.16b, vdata1.16b, #0
> +	cmeq	vhas_chr1.16b, vdata1.16b, vrepchr.16b
> +	cmeq	vhas_nul2.16b, vdata2.16b, #0
> +	cmeq	vhas_chr2.16b, vdata2.16b, vrepchr.16b
> +	/* Use a fast check for the termination condition.  */
> +	orr	vhas_chr1.16b, vhas_nul1.16b, vhas_chr1.16b
> +	orr	vhas_chr2.16b, vhas_nul2.16b, vhas_chr2.16b
> +	orr	vend1.16b, vhas_chr1.16b, vhas_chr2.16b
> +	addp	vend1.2d, vend1.2d, vend1.2d
> +	mov	tmp1, vend1.2d[0]
> +	cbz	tmp1, L(loop)
> +
> +	/* Termination condition found.  Now need to establish exactly why
> +	   we terminated.  */
> +	and	vhas_chr1.16b, vhas_chr1.16b, vrepmask.16b
> +	and	vhas_chr2.16b, vhas_chr2.16b, vrepmask.16b
> +	addp	vend1.16b, vhas_chr1.16b, vhas_chr2.16b		// 256->128
> +	addp	vend1.16b, vend1.16b, vend1.16b		// 128->64
> +
> +	mov	tmp1, vend1.2d[0]
> +L(tail):
> +	/* Count the trailing zeros, by bit reversing...  */
> +	rbit	tmp1, tmp1
> +	/* Re-bias source.  */
> +	sub	src, src, #32
> +	clz	tmp1, tmp1	/* ... and counting the leading zeros.  */
> +	/* tmp1 is twice the offset into the fragment.  */
> +	add	result, src, tmp1, lsr #1
> +	ret
> +
> +END(__strchrnul)
> +weak_alias (__strchrnul, strchrnul)



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]