This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
- From: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- To: libc-alpha at sourceware dot org
- Date: Wed, 18 Sep 2013 14:41:06 -0300
- Subject: Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7
- Authentication-results: sourceware.org; auth=none
- References: <523715EE dot 9070408 at linux dot vnet dot ibm dot com> <20130917061516 dot GA30130 at bubble dot grove dot modra dot org> <5239B9FE dot 4090506 at linux dot vnet dot ibm dot com> <5239CC7B dot 5010804 at twiddle dot net>
On 18-09-2013 12:53, Richard Henderson wrote:
> On 09/18/2013 07:34 AM, Adhemerval Zanella wrote:
>> + extrdi. rTMP, rALT, 8, 0
>> + stbu rTMP, 8(rRTN)
>> + beqlr
>> + extrdi. rTMP, rALT, 8, 8
>> + stbu rTMP, 1(rRTN)
>> + beqlr
>> + extrdi. rTMP, rALT, 8, 16
>> + stbu rTMP, 1(rRTN)
>> + beqlr
>> + extrdi. rTMP, rALT, 8, 24
>> + stbu rTMP, 1(rRTN)
>> + beqlr
>> + extrdi. rTMP, rALT, 8, 32
>> + stbu rTMP, 1(rRTN)
>> + beqlr
>> + extrdi. rTMP, rALT, 8, 40
>> + stbu rTMP, 1(rRTN)
>> + beqlr
>> + extrdi. rTMP, rALT, 8, 48
>> + stbu rTMP, 1(rRTN)
>> + beqlr
>> + stbu rALT, 1(rRTN)
> I, like Ondrej, have trouble believing that 4 arithmetic insns + 1 unaligned
> load + 1 unaligned store is slower than this compare-branch ladder.
>
> However good Power7's branch predictor is, I bet its out-of-order insn
> scheduler is better. Issue the 6 insns, return from subroutine, surely.
You might be right, but regardless the fact is, for POWER7, by removing the branch hints
of the instructions I observed an improvement in the latency. You can check the result
in the first email I sent compared to the POWER4 (default) implementation.
>
> You've got the location of the zero in rMASK from cmpb:
>
> cntlzd rMASK, rMASK // extract bit offset of nul byte
> srdi rMASK, rMASK, 3 // convert bit offset to byte offset
> addi rALT, rMASK, -7 // include the previous 7 bytes plus the nul
> ldx rTMP, rSRC, rALT // perform one last unaligned copy
> stdx rTMP, rRTN, rALT
> add rRTN, rRTN, rMASK // adjust the return value
> blr
>
> For little-endian one needs 2-3 more insns, since there's no corresponding
> count trailing zeros insn.
This is wrong: there is cases where rRTN may be aligned and rALT is not result in
a unaligned stdx that ends accessing invalid memory (I tested your suggestion).