This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7


On 18-09-2013 12:53, Richard Henderson wrote:
> On 09/18/2013 07:34 AM, Adhemerval Zanella wrote:
>> +	extrdi.	rTMP, rALT, 8, 0
>> +	stbu	rTMP, 8(rRTN)
>> +	beqlr
>> +	extrdi.	rTMP, rALT, 8, 8
>> +	stbu	rTMP, 1(rRTN)
>> +	beqlr
>> +	extrdi.	rTMP, rALT, 8, 16
>> +	stbu	rTMP, 1(rRTN)
>> +	beqlr
>> +	extrdi.	rTMP, rALT, 8, 24
>> +	stbu	rTMP, 1(rRTN)
>> +	beqlr
>> +	extrdi.	rTMP, rALT, 8, 32
>> +	stbu	rTMP, 1(rRTN)
>> +	beqlr
>> +	extrdi.	rTMP, rALT, 8, 40
>> +	stbu	rTMP, 1(rRTN)
>> +	beqlr
>> +	extrdi.	rTMP, rALT, 8, 48
>> +	stbu	rTMP, 1(rRTN)
>> +	beqlr
>> +	stbu	rALT, 1(rRTN)
> I, like Ondrej, have trouble believing that 4 arithmetic insns + 1 unaligned
> load + 1 unaligned store is slower than this compare-branch ladder.
>
> However good Power7's branch predictor is, I bet its out-of-order insn
> scheduler is better.  Issue the 6 insns, return from subroutine, surely.

You might be right, but regardless the fact is, for POWER7, by removing the branch hints
of the instructions I observed an improvement in the latency. You can check the result
in the first email I sent compared to the POWER4 (default) implementation.

>
> You've got the location of the zero in rMASK from cmpb:
>
>   cntlzd   rMASK, rMASK      // extract bit offset of nul byte
>   srdi     rMASK, rMASK, 3   // convert bit offset to byte offset
>   addi     rALT, rMASK, -7   // include the previous 7 bytes plus the nul
>   ldx      rTMP, rSRC, rALT  // perform one last unaligned copy
>   stdx     rTMP, rRTN, rALT
>   add      rRTN, rRTN, rMASK // adjust the return value
>   blr
>
> For little-endian one needs 2-3 more insns, since there's no corresponding
> count trailing zeros insn.

This is wrong: there is cases where rRTN may be aligned and rALT is not result in
a unaligned stdx that ends accessing invalid memory (I tested your suggestion).


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]