This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7


On 09/18/2013 10:41 AM, Adhemerval Zanella wrote:
>> However good Power7's branch predictor is, I bet its out-of-order insn
>> scheduler is better.  Issue the 6 insns, return from subroutine, surely.
> 
> You might be right, but regardless the fact is, for POWER7, by removing the branch hints
> of the instructions I observed an improvement in the latency. You can check the result
> in the first email I sent compared to the POWER4 (default) implementation.

This I can easily believe.

>> You've got the location of the zero in rMASK from cmpb:
>>
>>   cntlzd   rMASK, rMASK      // extract bit offset of nul byte
>>   srdi     rMASK, rMASK, 3   // convert bit offset to byte offset
>>   addi     rALT, rMASK, -7   // include the previous 7 bytes plus the nul
>>   ldx      rTMP, rSRC, rALT  // perform one last unaligned copy
>>   stdx     rTMP, rRTN, rALT
>>   add      rRTN, rRTN, rMASK // adjust the return value
>>   blr
>>
>> For little-endian one needs 2-3 more insns, since there's no corresponding
>> count trailing zeros insn.
> 
> This is wrong: there is cases where rRTN may be aligned and rALT is not result in
> a unaligned stdx that ends accessing invalid memory (I tested your suggestion).

Of course rTN+rALT is not aligned, that's the whole point.

Perhaps I did make some mistake in the arithmetic above,
but it can't be that far off correct...

Actually, testing myself it seems absolutely correct.

---
#include <string.h>
#include <assert.h>

char x[16] __attribute__((aligned(8))) = { "abcd" "efgh" "ij\0l" "mnop" };
char y[16] __attribute__((aligned(8))) = { "zzzz" "zzzz" "zzzz"  "zzzz" };

int main()
{
    long rmask, ralt, rtmp;
    char *rsrc = x + 8;
    char *rdst = y + 8;

    asm("ld	%[rtmp], 0(%[rsrc])\n\t"
	"cmpb	%[rmask],%[rtmp],%[rzero]\n\t"

        "cntlzd	%[rmask], %[rmask]\n\t"
	"srdi	%[rmask], %[rmask],3\n\t"
	"addi	%[ralt], %[rmask],-7\n\t"
	"ldx	%[rtmp], %[rsrc], %[ralt]\n\t"
	"stdx	%[rtmp], %[rdst], %[ralt]\n\t"
	"add	%[rdst], %[rdst], %[rmask]"

	: [rdst] "+b" (rdst),
          [rtmp] "=r" (rtmp),
          [ralt] "=r" (ralt),
          [rmask] "=r" (rmask)
        : [rzero] "r" (0),
          [rsrc] "b" (rsrc)
        : "memory");

    assert(*rdst == '\0');
    assert(memcmp(y, "zzzd" "efgh" "ij\0z" "zzzz", 16) == 0);
    return 0;
}
---

The first two insns of the asm, before the blank, set up the register
conditions from the body of your loop.  The assertions check that we copied the
last 8 bytes and set up the result pointer for stpcpy correctly.


r~


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]