This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
- From: Richard Henderson <rth at twiddle dot net>
- To: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Wed, 18 Sep 2013 11:24:22 -0700
- Subject: Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
- Authentication-results: sourceware.org; auth=none
- References: <523715EE dot 9070408 at linux dot vnet dot ibm dot com> <20130917061516 dot GA30130 at bubble dot grove dot modra dot org> <5239B9FE dot 4090506 at linux dot vnet dot ibm dot com> <5239CC7B dot 5010804 at twiddle dot net> <5239E5B2 dot 9050604 at linux dot vnet dot ibm dot com>
On 09/18/2013 10:41 AM, Adhemerval Zanella wrote:
>> However good Power7's branch predictor is, I bet its out-of-order insn
>> scheduler is better. Issue the 6 insns, return from subroutine, surely.
>
> You might be right, but regardless the fact is, for POWER7, by removing the branch hints
> of the instructions I observed an improvement in the latency. You can check the result
> in the first email I sent compared to the POWER4 (default) implementation.
This I can easily believe.
>> You've got the location of the zero in rMASK from cmpb:
>>
>> cntlzd rMASK, rMASK // extract bit offset of nul byte
>> srdi rMASK, rMASK, 3 // convert bit offset to byte offset
>> addi rALT, rMASK, -7 // include the previous 7 bytes plus the nul
>> ldx rTMP, rSRC, rALT // perform one last unaligned copy
>> stdx rTMP, rRTN, rALT
>> add rRTN, rRTN, rMASK // adjust the return value
>> blr
>>
>> For little-endian one needs 2-3 more insns, since there's no corresponding
>> count trailing zeros insn.
>
> This is wrong: there is cases where rRTN may be aligned and rALT is not result in
> a unaligned stdx that ends accessing invalid memory (I tested your suggestion).
Of course rTN+rALT is not aligned, that's the whole point.
Perhaps I did make some mistake in the arithmetic above,
but it can't be that far off correct...
Actually, testing myself it seems absolutely correct.
---
#include <string.h>
#include <assert.h>
char x[16] __attribute__((aligned(8))) = { "abcd" "efgh" "ij\0l" "mnop" };
char y[16] __attribute__((aligned(8))) = { "zzzz" "zzzz" "zzzz" "zzzz" };
int main()
{
long rmask, ralt, rtmp;
char *rsrc = x + 8;
char *rdst = y + 8;
asm("ld %[rtmp], 0(%[rsrc])\n\t"
"cmpb %[rmask],%[rtmp],%[rzero]\n\t"
"cntlzd %[rmask], %[rmask]\n\t"
"srdi %[rmask], %[rmask],3\n\t"
"addi %[ralt], %[rmask],-7\n\t"
"ldx %[rtmp], %[rsrc], %[ralt]\n\t"
"stdx %[rtmp], %[rdst], %[ralt]\n\t"
"add %[rdst], %[rdst], %[rmask]"
: [rdst] "+b" (rdst),
[rtmp] "=r" (rtmp),
[ralt] "=r" (ralt),
[rmask] "=r" (rmask)
: [rzero] "r" (0),
[rsrc] "b" (rsrc)
: "memory");
assert(*rdst == '\0');
assert(memcmp(y, "zzzd" "efgh" "ij\0z" "zzzz", 16) == 0);
return 0;
}
---
The first two insns of the asm, before the blank, set up the register
conditions from the body of your loop. The assertions check that we copied the
last 8 bytes and set up the result pointer for stpcpy correctly.
r~