This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] aarch64: optimize the unaligned case of memcmp
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: Sebastian Pop <s dot pop at samsung dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Cc: Marcus Shawcroft <Marcus dot Shawcroft at arm dot com>, "maxim dot kuvyrkov at linaro dot org" <maxim dot kuvyrkov at linaro dot org>, Ramana Radhakrishnan <Ramana dot Radhakrishnan at arm dot com>, "ryan dot arnold at linaro dot org" <ryan dot arnold at linaro dot org>, "adhemerval dot zanella at linaro dot org" <adhemerval dot zanella at linaro dot org>, "sebpop at gmail dot com" <sebpop at gmail dot com>, nd <nd at arm dot com>
- Date: Fri, 23 Jun 2017 21:28:10 +0000
- Subject: Re: [PATCH] aarch64: optimize the unaligned case of memcmp
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco dot Dijkstra at arm dot com;
- Nodisclaimer: True
- References: <CGME20170622233226uscas1p213aefedba5fe47e520aac1226a731162@uscas1p2.samsung.com> <1498174226-16525-1-git-send-email-s.pop@samsung.com>,<637cf51c-160d-172f-6520-bba51058f85e@samsung.com>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
Sebastian Pop wrote:
> If I remove all the alignment code, I get less performance on the hikey
> A53 board.
> With this patch:
@@ -142,9 +143,23 @@ ENTRY(memcmp)
.p2align 6
.Lmisaligned8:
+
+ cmp limit, #8
+ b.lo .LmisalignedLt8
+
+ .p2align 5
+.Lloop_part_aligned:
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+ subs limit_wd, limit_wd, #1
+.Lstart_part_realigned:
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ cbnz diff, .Lnot_limit
+ b.ne .Lloop_part_aligned
+
+.LmisalignedLt8:
sub limit, limit, #1
1:
- /* Perhaps we can do better than this. */
ldrb data1w, [src1], #1
ldrb data2w, [src2], #1
subs limit, limit, #1
Where is the setup of limit_wd and limit???
I would expect the small cases to be faster since you avoid around 10 cycles of mostly
ALU ops that make very little progress. So it should take several iterations with an extra
unaligned access to before you're worse off. In memcpy (which is similar with 2 streams)
I align after 96 bytes.
> With the extra patch:
--- a/libc/arch-arm64/generic/bionic/memcmp.S
+++ b/libc/arch-arm64/generic/bionic/memcmp.S
@@ -159,7 +159,7 @@ ENTRY(memcmp)
/* Sources are not aligned align one of the sources find max offset
from aligned boundary. */
- and tmp1, src1, #0x7
+ and tmp1, src2, #0x7
orr tmp3, xzr, #0x8
sub pos, tmp3, tmp1
Note it's more readable to write mov tmp3, 8. However it's even better to use a
writeback of 8 in the unaligned loads, and then subtract tmp1 from src1 and src2 -
this saves 2 instructions.
Wilco