This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- To: libc-alpha at sourceware dot org
- Date: Thu, 09 Jul 2015 16:31:17 -0300
- Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- Authentication-results: sourceware.org; auth=none
- References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <559617FF dot 8010100 at redhat dot com> <20150703085542 dot GE32307 at domone> <55968AF8 dot 8060104 at redhat dot com> <20150703171121 dot GA23898 at domone> <1436283324 dot 12188 dot 25 dot camel at oc7878010663> <20150709190252 dot GD18030 at domone>
On 09-07-2015 16:02, OndÅej BÃlka wrote:
> On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote:
> > But these could be done without much of our help. We need to keep these
>>> writable to support this hack. I don't know exact assembly for powerpc,
>>> it should be similar to how do it on x64:
>>>
>>> int x;
>>>
>>> int foo()
>>> {
>>> #ifdef SHARED
>>> asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)");
>>> #else
>>> asm ("lea x(%rip), %rax; movb $32, (%rax)");
>>> #endif
>>> return &x;
>>> }
>>>
>>
>> Not so simple on PowerISA as we don't have PC-relative addressing.
>>
>> 1) The global entry requires 2 instruction to establish the TOC/GOT
>> 2) Medium model requires two instructions (fused) to load a pointer from
>> the GOT.
>> 3) Finally we can load the cached hwcap.
>>
>> None of this is required for the TP+offset.
>>
> And why you didn't wrote that when it was first suggested? When you don't answer
> it looks like you don't want to answer because that suggestion is better.
>
> Here problem isn't lack of relative addressing but that you don't start
> with GOT in register.
>
> You certainly could do similar hack as you do with tcb and place hwcap
> bits just after that so you could do just one load.
>
> That you require so many instructions on powerpc is gcc bug, rather than
> rule. You don't need that many instructions when you place frequent
> symbols in -32768..32767 range. For example here you could save one
> addition.
>
> int x, y;
> int foo()
> {
> return x + y;
> }
>
> original
>
> 00000000000007d0 <foo>:
> 7d0: 02 00 4c 3c addis r2,r12,2
> 7d4: 30 78 42 38 addi r2,r2,30768
> 7d8: 00 00 00 60 nop
> 7dc: 30 80 42 e9 ld r10,-32720(r2)
> 7e0: 00 00 00 60 nop
> 7e4: 38 80 22 e9 ld r9,-32712(r2)
> 7e8: 00 00 6a 80 lwz r3,0(r10)
> 7ec: 00 00 29 81 lwz r9,0(r9)
> 7f0: 14 4a 63 7c add r3,r3,r9
> 7f4: b4 07 63 7c extsw r3,r3
> 7f8: 20 00 80 4e blr
>
> new
>
> addis r2,r12,2
> ld r10,-1952(r2)
> ld r9,-1944(r2)
> lwz r3,0(r10)
> lwz r9,0(r9)
> add r3,r3,r9
> extsw r3,r3
> blr
No you can't, you need to take in consideration powerpc64le ELFv2 ABi has two
entrypoints for every function, global and local, with former being used when
you need to materialize the TOC while latter you can use the same TOC. And
compiler has no information regarding this, it has to be decided by the linker.
For the example you posted, the assembly is:
foo:
0: addis 2,12,.TOC.-0b@ha
addi 2,2,.TOC.-0b@l
.localentry foo,.-foo
addis 10,2,.LC0@toc@ha # gpr load fusion, type long
ld 10,.LC0@toc@l(10)
addis 9,2,.LC1@toc@ha # gpr load fusion, type long
ld 9,.LC1@toc@l(9)
lwz 3,0(10)
lwz 9,0(9)
add 3,3,9
extsw 3,3
blr
Even if you place the symbol in the -32768..32767 range you still need
to take in consideration the symbol can be called either by '0:' or
by the '.localentry' and for both cases you need the proper TOC. And
for POWER8 the addis+ld should be fused, resulting in latency similar
to one load instruction.
>
>
>> Telling me how x86 does things is not much help.
>
> That why we need to know how that would work on powerpc.
>
>>>
>>>> Without a concrete implementation I can't comment on one or the other.
>>>> It is in my opinion overly harsh to force IBM to go implement this new
>>>> feature. They have space in the TCB per the ABI and may use it for their
>>>> needs. I think the community should investigate symbol address munging
>>>> as a method for storing data in addresses and make a generic API from it,
>>>> likewise I think the community should investigate standardizing tp+offset
>>>> data access behind a set of accessor macros and normalizing the usage
>>>> across the 5 or 6 architectures that use it.
>>>>
>>> I would like this as with access to that I could improve performance of
>>> several inlines.
>>>
>>>
>>>>> Also I now have additional comment with api as if you want faster checks
>>>>> wouldn't be faster to save each bit of hwcap into byte field so you
>>>>> could avoid using mask at each check?
>>>>
>>>> That is an *excellent* suggestion, and exactly the type of technical
>>>> feedback that we should be giving IBM, and Carlos can confirm if they've
>>>> tried such "unpacking" of the bits into byte fields. Such unpacking is
>>>> common in other machine implementations.
>>>>
>> This does not help on Power, Any (byte, halfword, word, doubleword,
>> quadword) aligned load is the same performance. Splitting our bits to
>> bytes just slow things down. Consider:
>>
>> if (__builtin_cpu_supports(ARCH_2_07) &&
>> __builtin_cpu_supports(VEC_CRYPTO))
>>
>> This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as
>> byte Boolean.
>>
>> Again value judgements about that is fast or slow can vary by platform.
>
> Instruction count means nothing if you don't have good intuition about
> powerpc platform. If you consider these your three instructions are lot
> slower than byte Booleans.
>
> Use following benchmark. You need separate compilation as to simulate
> many calls of function that uses hwcap that are not optimized away by
> gcc. I used computation before hwcap selection as without that there
> wouldn't be much difference as with OoO execution it would mostly
> measure latency of loads. It would still be slower but its 1.90s vs 1.92s
>
> Adding third check makes no difference, and case of one is obviously
> faster.
>
> Also how are you sure that checking more flags happens often to justify
> any potential savings with more checks if there were any savings?
>
> Benchmark is following:
>
> [neleai@gcc2-power8 ~]$ echo c.c:;cat c.c; echo x.c:;cat x.c;echo y.c:;
> cat y.c; gcc -O3 x.c -c; gcc -O3 x.o c.c -o x; gcc -O3 y.c -c; gcc -O3
> c.c y.o -o y; time ./x ; time ./y; time ./x; time ./y
>
> c.c:
> volatile int v, w;
> volatile int u;
> int main()
> {
> u= -1;
> v = 1; w = 1;
> long i;
> unsigned long sum = 0;
> for (i=0;i<500000000;i++)
> sum += foo(sum, 42);
> return sum;
>
> }
> x.c:
> extern int v,w;
> int __attribute__((noinline))foo(int x, int y){
> x= 3 * x - 32 + y;
> y = 4 * x + 5;
> if (v & w)
> return 3 * x;
> return 5 * y;
> }
>
> y.c:
> extern int u;
> int __attribute__((noinline))foo(int x, int y){
> x= 3 * x - 32 + y;
> y = 4 * x + 5;
> if (((u&((1<<17)|(1<<21)))==((1<<17)|(1<<21))))
> return 3 * x;
> return 5 * y;
> }
>
>
> real 0m2.390s
> user 0m2.389s
> sys 0m0.001s
>
> real 0m2.531s
> user 0m2.529s
> sys 0m0.001s
>
> real 0m2.390s
> user 0m2.389s
> sys 0m0.001s
>
> real 0m2.532s
> user 0m2.530s
> sys 0m0.001s
>