This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 27/27] S390: Optimize memrchr.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Stefan Liebler <stli at linux dot vnet dot ibm dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Thu, 2 Jul 2015 01:11:38 +0200
- Subject: Re: [PATCH 27/27] S390: Optimize memrchr.
- Authentication-results: sourceware.org; auth=none
- References: <1435319512-22245-1-git-send-email-stli at linux dot vnet dot ibm dot com> <1435319512-22245-28-git-send-email-stli at linux dot vnet dot ibm dot com>
On Fri, Jun 26, 2015 at 01:51:52PM +0200, Stefan Liebler wrote:
> This patch provides optimized version of memrchr with the z13 vector
> instructions.
>> +
> +.Lfound_g16:
> + vlgvb %r1,%v17,7 /* Index of c or 16 if not found. */
> + lgr %r2,%r5
> + lghi %r4,15 /* Highest index of a full vr. */
> + clr %r1,%r4
> +.Lfound_lt16:
> + la %r2,0(%r1,%r2) /* Store current pointer to found c. */
> + ber %r14 /* Return if found c is last loaded byte. */
> +
> + /* Shift vector elements left and start search again. */
> + aghi %r1,1 /* Start search after current index. */
> + slr %r4,%r1 /* New highest index. */
> + sll %r1,3 /* Calculate byte count for vector shift
> + left. */
> + vlvgg %v17,%r1,0
> + vslb %v16,%v16,%v17 /* Vector shift left by byte by number of bytes
> + specified in bits 1-4 of byte 7 in v17. */
> + vfeeb %v17,%v16,%v18 /* Find c. */
> + la %r5,1(%r2) /* Save start-address of shifted v16. */
> + vlgvb %r1,%v17,7 /* Index of c or 16 if not found */
> + clr %r1,%r4
> + locgrle %r2,%r5 /* Use stored address as base if c found. */
> + jle .Lfound_lt16 /* Found c within loaded bytes. */
> + br %r14 /* No further c found, return last stored c. */
> +
> +.Lfound_end:
> + la %r2,0(%r4,%r2) /* Return pointer to found c */
> + br %r14
> +
This sequence looks slow. Both here and in strrchr you should use
something better than loop. One possibility is to use shuffle to do
16-byte byteswap but it needs load constant.
A second way is to use same trick as in calculating clz with ctz. If
instruction produces nonzero for match an 0 for nonmatch then to find
highest nonzero byte you first need or all lower bytes with it for
example by
x |= x >> 8;
x |= x >> 4;
x |= x >> 2;
x |= x >> 1;
Then you look for index of first zero byte which is one more than
highest nonzero byte. When roles of 1 and 0 are reversed and all nonzero
bytes have bit in common you could use &.
For strrchr idea is same. When handling trailing zero you could make
similar trick with zero mask
x |= x << 8;
x |= x << 4;
x |= x << 2;
x |= x << 1;
x = ~x;
to construct vector that is 255 before and at first zero byte and 0 after
that. Its important that first zero will get 255 to handle strrchr(x,0)
without separate check if c is zero.