This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] aarch64: optimize the unaligned case of memcmp

From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
To: Sebastian Pop <s dot pop at samsung dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
Cc: Marcus Shawcroft <Marcus dot Shawcroft at arm dot com>, "maxim dot kuvyrkov at linaro dot org" <maxim dot kuvyrkov at linaro dot org>, Ramana Radhakrishnan <Ramana dot Radhakrishnan at arm dot com>, "ryan dot arnold at linaro dot org" <ryan dot arnold at linaro dot org>, "adhemerval dot zanella at linaro dot org" <adhemerval dot zanella at linaro dot org>, "sebpop at gmail dot com" <sebpop at gmail dot com>, nd <nd at arm dot com>
Date: Fri, 23 Jun 2017 21:28:10 +0000
Subject: Re: [PATCH] aarch64: optimize the unaligned case of memcmp
Authentication-results: sourceware.org; auth=none
Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco dot Dijkstra at arm dot com;
Nodisclaimer: True
References: <CGME20170622233226uscas1p213aefedba5fe47e520aac1226a731162@uscas1p2.samsung.com> <1498174226-16525-1-git-send-email-s.pop@samsung.com>,<637cf51c-160d-172f-6520-bba51058f85e@samsung.com>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:99

Sebastian Pop wrote:

> If I remove all the alignment code, I get less performance on the hikey 
> A53 board.
> With this patch:

@@ -142,9 +143,23 @@ ENTRY(memcmp)

         .p2align 6
  .Lmisaligned8:
+
+       cmp     limit, #8
+       b.lo    .LmisalignedLt8
+
+       .p2align 5
+.Lloop_part_aligned:
+       ldr     data1, [src1], #8
+       ldr     data2, [src2], #8
+       subs    limit_wd, limit_wd, #1
+.Lstart_part_realigned:
+       eor     diff, data1, data2      /* Non-zero if differences found. */
+       cbnz    diff, .Lnot_limit
+       b.ne    .Lloop_part_aligned
+
+.LmisalignedLt8:
         sub     limit, limit, #1
  1:
-       /* Perhaps we can do better than this.  */
         ldrb    data1w, [src1], #1
         ldrb    data2w, [src2], #1
         subs    limit, limit, #1

Where is the setup of limit_wd and limit???

I would expect the small cases to be faster since you avoid around 10 cycles of mostly
ALU ops that make very little progress. So it should take several iterations with an extra
unaligned access to before you're worse off. In memcpy (which is similar with 2 streams)
I align after 96 bytes.

> With the extra patch:

--- a/libc/arch-arm64/generic/bionic/memcmp.S
+++ b/libc/arch-arm64/generic/bionic/memcmp.S
@@ -159,7 +159,7 @@ ENTRY(memcmp)
         /* Sources are not aligned align one of the sources find max offset
            from aligned boundary. */

-       and     tmp1, src1, #0x7
+       and     tmp1, src2, #0x7
         orr     tmp3, xzr, #0x8
         sub     pos, tmp3, tmp1

Note it's more readable to write mov tmp3, 8. However it's even better to use a 
writeback of 8 in the unaligned loads, and then subtract tmp1 from src1 and src2 -
this saves 2 instructions.

Wilco

Follow-Ups:
- Re: [PATCH] aarch64: optimize the unaligned case of memcmp
  - From: Sebastian Pop

References:
- [PATCH] aarch64: optimize the unaligned case of memcmp
  - From: Sebastian Pop
- Re: [PATCH] aarch64: optimize the unaligned case of memcmp
  - From: Sebastian Pop

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]