This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB


On Thu, Jul 09, 2015 at 07:17:01PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 09-07-2015 18:51, OndÅej BÃlka wrote:
> > On Thu, Jul 09, 2015 at 04:31:17PM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 09-07-2015 16:02, OndÅej BÃlka wrote:
> >>> On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote:
> >>>> Not so simple on PowerISA as we don't have PC-relative addressing.
> >>>>
> >>>> 1) The global entry requires 2 instruction to establish the TOC/GOT
> >>>> 2) Medium model requires two instructions (fused) to load a pointer from
> >>>> the GOT.
> >>>> 3) Finally we can load the cached hwcap.
> >>>>
> >>>> None of this is required for the TP+offset.
> >>>>
> >>> And why you didn't wrote that when it was first suggested? When you don't answer 
> >>> it looks like you don't want to answer because that suggestion is better.
> >>>
> >>> Here problem isn't lack of relative addressing but that you don't start
> >>> with GOT in register. 
> >>>
> >>> You certainly could do similar hack as you do with tcb and place hwcap
> >>> bits just after that so you could do just one load.
> >>>
> >>> That you require so many instructions on powerpc is gcc bug, rather than
> >>> rule. You don't need that many instructions when you place frequent
> >>> symbols in -32768..32767 range. For example here you could save one
> >>> addition.
> >>>
> >>> int x, y;
> >>> int foo()
> >>> {
> >>>   return x + y;
> >>> }
> >>>
> >>> original
> >>>
> >>> 00000000000007d0 <foo>:
> >>>  7d0:	02 00 4c 3c 	addis   r2,r12,2
> >>>  7d4:	30 78 42 38 	addi    r2,r2,30768
> >>>  7d8:	00 00 00 60 	nop
> >>>  7dc:	30 80 42 e9 	ld      r10,-32720(r2)
> >>>  7e0:	00 00 00 60 	nop
> >>>  7e4:	38 80 22 e9 	ld      r9,-32712(r2)
> >>>  7e8:	00 00 6a 80 	lwz     r3,0(r10)
> >>>  7ec:	00 00 29 81 	lwz     r9,0(r9)
> >>>  7f0:	14 4a 63 7c 	add     r3,r3,r9
> >>>  7f4:	b4 07 63 7c 	extsw   r3,r3
> >>>  7f8:	20 00 80 4e 	blr
> >>>
> >>> new
> >>>
> >>>  	addis   r2,r12,2
> >>> 	ld      r10,-1952(r2)
> >>> 	ld      r9,-1944(r2)
> >>> 	lwz     r3,0(r10)
> >>> 	lwz     r9,0(r9)
> >>> 	add     r3,r3,r9
> >>> 	extsw   r3,r3
> >>> 	blr
> >>
> >> No you can't, you need to take in consideration powerpc64le ELFv2 ABi has two
> >> entrypoints for every function, global and local, with former being used when
> >> you need to materialize the TOC while latter you can use the same TOC. And
> >> compiler has no information regarding this, it has to be decided by the linker.
> >>
> > Of course I can, reusing TOC is not mandatory. That would just decrease
> > performance a bit for local.
> 
> Reusing TOC is exactly the optimization linker will do to avoid call the
> global entrypoint.  And the problem is 1. it still requires to materialize
> the TOC on global entrypoints, where you will need to save/restore it
> in PLT stubs and 2. you will need a hwcap copy per TOC/DSO.  I think 
> Steven proposal is exactly to avoid these. In fact this was one option
> I advocate to him before he remind the issues.
>
As 1 that isn't problem as when you use PLT stubs then you already have
bigger hazards from entry so you don't have to worry about getting hwcap. 
As for interDSO stubs you could use local entry this happens only when you 
repeatedly call function from different dso. Moreover you must use only
local variables there, otherwise you would need to materialize TOC
anyway and it would be free for hwcap. Also it doesn't looks good as you 
should use ifunc generated by gcc anyway to directly jump after check 
and save few cycles.

2. is one of my main critique. What argument Steven used for convincing
you?

Problem is that while his proposal scales with number of thread which is
greater than 1 this scales with number of dso that use hwcap. Which on
average could be 0.05 or similar as most packages won't use it at all.
So I ask once again where is your evidence to show it will be frequently
used? Particularily to pay cost of binaries where its never used and as
they could create many threads a cost will increase?



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]