This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- Cc: libc-alpha at sourceware dot org
- Date: Fri, 10 Jul 2015 01:27:10 +0200
- Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- Authentication-results: sourceware.org; auth=none
- References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <559617FF dot 8010100 at redhat dot com> <20150703085542 dot GE32307 at domone> <55968AF8 dot 8060104 at redhat dot com> <20150703171121 dot GA23898 at domone> <1436283324 dot 12188 dot 25 dot camel at oc7878010663> <20150709190252 dot GD18030 at domone> <559ECC05 dot 8040901 at linaro dot org> <20150709215130 dot GA2410 at domone> <559EF2DD dot 1050703 at linaro dot org>
On Thu, Jul 09, 2015 at 07:17:01PM -0300, Adhemerval Zanella wrote:
>
>
> On 09-07-2015 18:51, OndÅej BÃlka wrote:
> > On Thu, Jul 09, 2015 at 04:31:17PM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 09-07-2015 16:02, OndÅej BÃlka wrote:
> >>> On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote:
> >>>> Not so simple on PowerISA as we don't have PC-relative addressing.
> >>>>
> >>>> 1) The global entry requires 2 instruction to establish the TOC/GOT
> >>>> 2) Medium model requires two instructions (fused) to load a pointer from
> >>>> the GOT.
> >>>> 3) Finally we can load the cached hwcap.
> >>>>
> >>>> None of this is required for the TP+offset.
> >>>>
> >>> And why you didn't wrote that when it was first suggested? When you don't answer
> >>> it looks like you don't want to answer because that suggestion is better.
> >>>
> >>> Here problem isn't lack of relative addressing but that you don't start
> >>> with GOT in register.
> >>>
> >>> You certainly could do similar hack as you do with tcb and place hwcap
> >>> bits just after that so you could do just one load.
> >>>
> >>> That you require so many instructions on powerpc is gcc bug, rather than
> >>> rule. You don't need that many instructions when you place frequent
> >>> symbols in -32768..32767 range. For example here you could save one
> >>> addition.
> >>>
> >>> int x, y;
> >>> int foo()
> >>> {
> >>> return x + y;
> >>> }
> >>>
> >>> original
> >>>
> >>> 00000000000007d0 <foo>:
> >>> 7d0: 02 00 4c 3c addis r2,r12,2
> >>> 7d4: 30 78 42 38 addi r2,r2,30768
> >>> 7d8: 00 00 00 60 nop
> >>> 7dc: 30 80 42 e9 ld r10,-32720(r2)
> >>> 7e0: 00 00 00 60 nop
> >>> 7e4: 38 80 22 e9 ld r9,-32712(r2)
> >>> 7e8: 00 00 6a 80 lwz r3,0(r10)
> >>> 7ec: 00 00 29 81 lwz r9,0(r9)
> >>> 7f0: 14 4a 63 7c add r3,r3,r9
> >>> 7f4: b4 07 63 7c extsw r3,r3
> >>> 7f8: 20 00 80 4e blr
> >>>
> >>> new
> >>>
> >>> addis r2,r12,2
> >>> ld r10,-1952(r2)
> >>> ld r9,-1944(r2)
> >>> lwz r3,0(r10)
> >>> lwz r9,0(r9)
> >>> add r3,r3,r9
> >>> extsw r3,r3
> >>> blr
> >>
> >> No you can't, you need to take in consideration powerpc64le ELFv2 ABi has two
> >> entrypoints for every function, global and local, with former being used when
> >> you need to materialize the TOC while latter you can use the same TOC. And
> >> compiler has no information regarding this, it has to be decided by the linker.
> >>
> > Of course I can, reusing TOC is not mandatory. That would just decrease
> > performance a bit for local.
>
> Reusing TOC is exactly the optimization linker will do to avoid call the
> global entrypoint. And the problem is 1. it still requires to materialize
> the TOC on global entrypoints, where you will need to save/restore it
> in PLT stubs and 2. you will need a hwcap copy per TOC/DSO. I think
> Steven proposal is exactly to avoid these. In fact this was one option
> I advocate to him before he remind the issues.
>
As 1 that isn't problem as when you use PLT stubs then you already have
bigger hazards from entry so you don't have to worry about getting hwcap.
As for interDSO stubs you could use local entry this happens only when you
repeatedly call function from different dso. Moreover you must use only
local variables there, otherwise you would need to materialize TOC
anyway and it would be free for hwcap. Also it doesn't looks good as you
should use ifunc generated by gcc anyway to directly jump after check
and save few cycles.
2. is one of my main critique. What argument Steven used for convincing
you?
Problem is that while his proposal scales with number of thread which is
greater than 1 this scales with number of dso that use hwcap. Which on
average could be 0.05 or similar as most packages won't use it at all.
So I ask once again where is your evidence to show it will be frequently
used? Particularily to pay cost of binaries where its never used and as
they could create many threads a cost will increase?