This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: powerpc __tls_get_addr call optimization


On Sat, Mar 21, 2015 at 01:37:02PM +1030, Alan Modra wrote:
> On Fri, Mar 20, 2015 at 11:27:12AM -0400, Rich Felker wrote:
> > On Fri, Mar 20, 2015 at 06:25:02PM +1030, Alan Modra wrote:
> > > On Thu, Mar 19, 2015 at 11:33:16PM -0400, Carlos O'Donell wrote:
> > > > On 03/18/2015 10:56 PM, Alan Modra wrote:
> > > > > On Wed, Mar 18, 2015 at 01:07:32PM -0400, Carlos O'Donell wrote:
> > > > >> On 03/18/2015 02:11 AM, Alan Modra wrote:
> > > > >>> Now that Alex's fixes for static TLS have gone in, I figure it's worth
> > > > >>> revisiting an old patch of mine.
> > > > >>> https://sourceware.org/ml/libc-alpha/2009-03/msg00053.html
> > > > >>
> > > > >> I'm not against this patch, but it certainly seems like you would be
> > > > >> better served by just implementing tls descriptors?
> > > > > 
> > > > > I think this is one better than tls descriptors, because powerpc
> > > > > avoids the indirect function call used by tls descriptors.
> > > > 
> > > > You mean to say it is "faster" than tls descriptors, but at the same
> > > 
> > > To be honest, there isn't much difference in the optimized case where
> > > static TLS is available.  It boils down to an indirect call to a
> > > function that loads one value vs. a direct call to a stub that loads
> > > two values and compares one against zero.  I think what I've
> > > implemented is slightly better for PowerPC, but whether that would
> > > carry over to other architectures is debatable.
> > 
> > If the performance difference isn't measurable in real-world
> > applications, I would think uniformity between targets would be a lot
> > more valuable.
> 
> Think of my design as "TLS descriptors version 2".  I take the best
> features of TLS descriptors and add one trick, the special linker
> stub, that allows you to omit many of the nasty details of the current
> TLS descriptor design.  A target that currently has TLS support but no
> TLS descriptor support and follows the powerpc design:
> 1) won't need to implement gcc changes for tls descriptors,
> 2) won't need to define new relocations,
> 3) won't need to implement linker support for tls descriptors, quite a
>    large effort, and
> 4) won't need to implement dl-tlsdesc.S and tlsdesc.c in glibc, also
>    not a simple task.
> Another benefit in terms of reliability (and repeatable user timing!)
> is that extended TLS descriptors are not needed, so the locking and
> mallocing in tlsdeschtab.h is avoided.

If the lazy allocation stuff is removed (which it should be; it breaks
AS-safety and other things), the last issue would go away.

> Admittedly, part of the reason a port is so much easier is due to
> omitting lazy TLS resolution.  Lazy TLS is complex.  What's more, the
> per-target support code is non-trivial.  All of tlsdesc.c and half of
> dl-tlsdesc.S is lazy TLS support.  I question whether the added
> complexity provides commensurate benefit in real-world applications,
> apart from the degenerate case of loading a shared library that is
> never used.  (And even then, you'd need a lot of __thread variables to
> make it worthwhile.)
> 
> In fact, I wouldn't be surprised to find lazy TLS has a net negative
> benefit in real-world applications!
> /me dons asbestos suit.  :)

I completely agree. I want to see it removed.

> > I also don't see how your approach is a "direct call". The function
> > being called is in a different DSO so it has to go through a pointer
> > in the GOT or similar, in which case it's just as "indirect" as the
> > TLSDESC call would be.
> 
> It is a direct call to the linker provided stub, which will return
> after a few instructions in the optimized case when static TLS is
> available.

That linker-provided stub address is loaded from a "GOT slot" of some
sort, just like the tlsdesc function would be. Either way you have a
PC/GP-relative load followed by a jump to the loaded address. There's
actually one additional level of indirection to load this pointer for
TLSDESC, but for static TLS, the callee returns instantly after
performing a single load.

With non-TLSDESC dynamic TLS on the other hand, there's an additional
PC/GP-relative address computation (for the module/offset structure's
address to pass) in the caller, which should equal out with the cost
of the extra indirection for TLSDESC. But then there's a fair bit of
additional work to be done in the callee.

> Control is passed to __tls_get_addr_opt only when no static TLS was
> available for the shared library at the time the library was
> dynamically relocated, ie. it was dlopen'ed and not enough spare
> static TLS was free.

Where is contol passed if static TLS was used? Maybe I'm
misunderstanding your design? How would the dynamic linker resolve
some calls to __tls_get_addr to different places than other calls,
when there's only a single GOT entry for it?

Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]