This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB



On 09-07-2015 16:02, OndÅej BÃlka wrote:
> On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote:
>  > But these could be done without much of our help. We need to keep these
>>> writable to support this hack. I don't know exact assembly for powerpc,
>>> it should be similar to how do it on x64:
>>>
>>> int x;
>>>
>>> int foo()
>>> {
>>> #ifdef SHARED
>>> asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)");
>>> #else
>>> asm ("lea x(%rip), %rax; movb $32, (%rax)");
>>> #endif
>>> return &x;
>>> }
>>>
>>
>> Not so simple on PowerISA as we don't have PC-relative addressing.
>>
>> 1) The global entry requires 2 instruction to establish the TOC/GOT
>> 2) Medium model requires two instructions (fused) to load a pointer from
>> the GOT.
>> 3) Finally we can load the cached hwcap.
>>
>> None of this is required for the TP+offset.
>>
> And why you didn't wrote that when it was first suggested? When you don't answer 
> it looks like you don't want to answer because that suggestion is better.
> 
> Here problem isn't lack of relative addressing but that you don't start
> with GOT in register. 
> 
> You certainly could do similar hack as you do with tcb and place hwcap
> bits just after that so you could do just one load.
> 
> That you require so many instructions on powerpc is gcc bug, rather than
> rule. You don't need that many instructions when you place frequent
> symbols in -32768..32767 range. For example here you could save one
> addition.
> 
> int x, y;
> int foo()
> {
>   return x + y;
> }
> 
> original
> 
> 00000000000007d0 <foo>:
>  7d0:	02 00 4c 3c 	addis   r2,r12,2
>  7d4:	30 78 42 38 	addi    r2,r2,30768
>  7d8:	00 00 00 60 	nop
>  7dc:	30 80 42 e9 	ld      r10,-32720(r2)
>  7e0:	00 00 00 60 	nop
>  7e4:	38 80 22 e9 	ld      r9,-32712(r2)
>  7e8:	00 00 6a 80 	lwz     r3,0(r10)
>  7ec:	00 00 29 81 	lwz     r9,0(r9)
>  7f0:	14 4a 63 7c 	add     r3,r3,r9
>  7f4:	b4 07 63 7c 	extsw   r3,r3
>  7f8:	20 00 80 4e 	blr
> 
> new
> 
>  	addis   r2,r12,2
> 	ld      r10,-1952(r2)
> 	ld      r9,-1944(r2)
> 	lwz     r3,0(r10)
> 	lwz     r9,0(r9)
> 	add     r3,r3,r9
> 	extsw   r3,r3
> 	blr

No you can't, you need to take in consideration powerpc64le ELFv2 ABi has two
entrypoints for every function, global and local, with former being used when
you need to materialize the TOC while latter you can use the same TOC. And
compiler has no information regarding this, it has to be decided by the linker.

For the example you posted, the assembly is:

foo:
0:	addis 2,12,.TOC.-0b@ha
	addi 2,2,.TOC.-0b@l
	.localentry	foo,.-foo
	addis 10,2,.LC0@toc@ha		# gpr load fusion, type long
	ld 10,.LC0@toc@l(10)
	addis 9,2,.LC1@toc@ha		# gpr load fusion, type long
	ld 9,.LC1@toc@l(9)
	lwz 3,0(10)
	lwz 9,0(9)
	add 3,3,9
	extsw 3,3
	blr

Even if you place the symbol in the -32768..32767 range you still need
to take in consideration the symbol can be called either by '0:' or
by the '.localentry' and for both cases you need the proper TOC.  And
for POWER8 the addis+ld should be fused, resulting in latency similar
to one load instruction.


> 
>  
>> Telling me how x86 does things is not much help.
> 
> That why we need to know how that would work on powerpc.
> 
>>>
>>>> Without a concrete implementation I can't comment on one or the other.
>>>> It is in my opinion overly harsh to force IBM to go implement this new
>>>> feature. They have space in the TCB per the ABI and may use it for their
>>>> needs. I think the community should investigate symbol address munging
>>>> as a method for storing data in addresses and make a generic API from it,
>>>> likewise I think the community should investigate standardizing tp+offset
>>>> data access behind a set of accessor macros and normalizing the usage
>>>> across the 5 or 6 architectures that use it.
>>>>
>>> I would like this as with access to that I could improve performance of
>>> several inlines.
>>>
>>>
>>>>> Also I now have additional comment with api as if you want faster checks
>>>>> wouldn't be faster to save each bit of hwcap into byte field so you
>>>>> could avoid using mask at each check?
>>>>
>>>> That is an *excellent* suggestion, and exactly the type of technical
>>>> feedback that we should be giving IBM, and Carlos can confirm if they've
>>>> tried such "unpacking" of the bits into byte fields. Such unpacking is
>>>> common in other machine implementations.
>>>>
>> This does not help on Power, Any (byte, halfword, word, doubleword,
>> quadword) aligned load is the same performance. Splitting our bits to
>> bytes just slow things down. Consider:
>>
>> if (__builtin_cpu_supports(ARCH_2_07) &&   
>>     __builtin_cpu_supports(VEC_CRYPTO))
>>
>> This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as
>> byte Boolean. 
>>
>> Again value judgements about that is fast or slow can vary by platform.
> 
> Instruction count means nothing if you don't have good intuition about
> powerpc platform. If you consider these your three instructions are lot
> slower than byte Booleans. 
> 
> Use following benchmark. You need separate compilation as to simulate
> many calls of function that uses hwcap that are not optimized away by
> gcc. I used computation before hwcap selection as without that there
> wouldn't be much difference as with OoO execution it would mostly
> measure latency of loads. It would still be slower but its 1.90s vs 1.92s
> 
> Adding third check makes no difference, and case of one is obviously
> faster.
> 
> Also how are you sure that checking more flags happens often to justify
> any potential savings with more checks if there were any savings?
> 
> Benchmark is following:
> 
> [neleai@gcc2-power8 ~]$ echo c.c:;cat c.c; echo x.c:;cat x.c;echo y.c:;
> cat y.c; gcc -O3 x.c -c; gcc -O3 x.o c.c -o x; gcc -O3 y.c -c; gcc -O3
> c.c y.o -o y; time ./x ; time ./y; time ./x; time ./y
> 
> c.c:
> volatile int v, w;
> volatile int u;
> int main()
> {
>   u= -1;
>   v = 1; w = 1;
>   long i;
>   unsigned long sum = 0;
>   for (i=0;i<500000000;i++)
>     sum += foo(sum, 42);
>   return sum;
> 
> }
> x.c:
> extern int v,w;
> int __attribute__((noinline))foo(int x, int y){
>  x= 3 * x - 32 + y;
>  y = 4 * x + 5;
>  if (v & w)
>    return 3 * x;
>  return 5 * y;
> }
> 
> y.c:
> extern int u;
> int __attribute__((noinline))foo(int x, int y){
>  x= 3 * x - 32 + y;
>  y = 4 * x + 5;
>  if (((u&((1<<17)|(1<<21)))==((1<<17)|(1<<21))))
>    return 3 * x;
>  return 5 * y;
> }
> 
> 
> real	0m2.390s
> user	0m2.389s
> sys	0m0.001s
> 
> real	0m2.531s
> user	0m2.529s
> sys	0m0.001s
> 
> real	0m2.390s
> user	0m2.389s
> sys	0m0.001s
> 
> real	0m2.532s
> user	0m2.530s
> sys	0m0.001s
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]