This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 07/27] S390: Optimize strlen and wcslen.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Stefan Liebler <stli at linux dot vnet dot ibm dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Fri, 26 Jun 2015 15:00:58 +0200
- Subject: Re: [PATCH 07/27] S390: Optimize strlen and wcslen.
- Authentication-results: sourceware.org; auth=none
- References: <1435319512-22245-1-git-send-email-stli at linux dot vnet dot ibm dot com> <1435319512-22245-8-git-send-email-stli at linux dot vnet dot ibm dot com>
On Fri, Jun 26, 2015 at 01:51:32PM +0200, Stefan Liebler wrote:
> This patch provides optimized versions of strlen and wcslen with the z13 vector
Didn't read details about z13 so I will ask. These questions apply for
all functions.
> + lghi %r5,0 /* current_len = 0. */
> +
> + /* Align s to 16 byte. */
This way of masking tends to be slow due inputs where you read only few
bytes on first check due alignment.
In my experience a fastest is first check for page cross and if not then
do unaligned load of 16 bytes. It looks possible with vll unless its
limited to cache line.
> + risbg %r3,%r2,60,128+63,0 /* Test if s is aligned and
> + %r3 = bits 60-63 'and' 15. */
> + je .Lloop1 /* If s is aligned, loop aligned. */
This is performance problem as its relatively unpredictable branch
(29.8% calls are aligned to 16 bytes), and you save few cycles but lose
more.
> + lghi %r4,15
> + slr %r4,%r3 /* Compute highest index to load (15-x). */
> + vll %v16,%r4,0(%r2) /* Load up to 16 byte boundary. (vll needs
> + highest index, remaining bytes are 0.) */
> + ahi %r4,1 /* Work with loaded byte count. */
> + vfenezb %v16,%v16,%v16 /* Find element not equal with zero search. */
> + vlgvb %r5,%v16,7 /* Load zero index or 16 if not found. */
> + clr %r5,%r4 /* If found zero within loaded bytes? */
> + locgrl %r2,%r5 /* Then copy return value. */
> + blr %r14 /* And return. */
> + lgr %r5,%r4 /* No zero within loaded bytes,
> + process further bytes aligned. */
> + /* Find zero in 16 byte aligned loop. */
> +.Lloop1:
> + vl %v16,0(%r5,%r2) /* Load s. */
> + aghi %r5,16
> + vfenezbs %v16,%v16,%v16 /* Find element not equal with zero search. */
> + je .Lfound /* Jump away if zero was found. */
> + vl %v16,0(%r5,%r2)
> + aghi %r5,16
> + vfenezbs %v16,%v16,%v16
> + je .Lfound
> + vl %v16,0(%r5,%r2)
What addressing is allowed? If you could add offsets then following
looks faster
vl %v16,0(%r2)
vfenezbs %v16,%v16,%v16
je .Lfound0
vl %v16,16(%r2)
vfenezbs %v16,%v16,%v16
je .Lfound1
> + aghi %r5,16
> + vfenezbs %v16,%v16,%v16
> + je .Lfound
> + vl %v16,0(%r5,%r2)
> + aghi %r5,16
> + vfenezbs %v16,%v16,%v16
> + jne .Lloop1 /* No zero found -> loop. */
> +
> +.Lfound:
> + vlgvb %r2,%v16,7 /* Load byte index of zero. */
> + slgfi %r5,16 /* current_len -=16 */
> + algr %r2,%r5
> + br %r14
> +END(__strlen_vx)