This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: TLS redux

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Rich Felker <dalias at aerifal dot cx>
Cc: Roland McGrath <roland at hack dot frob dot com>, "GNU C. Library" <libc-alpha at sourceware dot org>
Date: Thu, 16 Jan 2014 01:40:02 +0100
Subject: Re: TLS redux
Authentication-results: sourceware.org; auth=none
References: <20140115022335 dot EB13174430 at topped-with-meat dot com> <20140115051359 dot GF24286 at brightrain dot aerifal dot cx>
On Wed, Jan 15, 2014 at 12:13:59AM -0500, Rich Felker wrote:
> On Tue, Jan 14, 2014 at 06:23:35PM -0800, Roland McGrath wrote:
> > I've finally caught up on the long threads about TLS issues.
> > (The good news is that this was a sizable fraction of all of my
> > libc-related backlog, so I'm much less behind than I was before!)
> > 
> > Other people have discussed many of the issues that I would have
> > raised if I'd participated all along, but not all of them.  I won't
> > summarize the whole discussion, but just mention the things I think
> > it's important not to overlook.  I don't really have anything to say
> > about most of the implementation details.  Only the last point or two
> > are issues about the changes being considered for 2.19.
> > 
> > * Lazy allocation is an explicit feature of the TLS ABI, not an
> >   incidental detail.  The wisdom of the feature can be debated, but
> >   the compatibility requirements are clear.
> 
> Yes, yes, and yes. :-)
> 
> >   It's a regression if this scenario stops working:
> >   1. Start a thousand threads
> >   2. dlopen a module containing __thread char buf[100 << 20];
> >   3. Start another thousand threads
> >   4. Call into the module on one thread so it uses its buf.
> >   5. Start a third thousand threads
> >   Now you should have 3000 threads but not 3000*100M memory use.
> >   (Here I mean address space reservation, regardless of consumption
> >   of machine resources, VM overcommit, etc.)
> >
There are similar scenarios that are time bomb. If a relevant module is a
profiler that accesses tls only when user request profiling and
then accessess all threads then in better case you get segfault. In
worse case program has sometimes excessive memory consumption so
administrator needs to periodically kill it without knowing the reason.

A second scenario is that module accesses big tls area for temporary
storage in several long running thread and you get a big memory leak as
another thread needs same module for unrelated long-time computation.

It is a pick your poison situation, nothing here could substitute a
knowledgable user.
 
> >   At least in the case of an existing binary dlopen caller (which
> >   could actually be either in an executable or in a DSO) and an
> >   existing binary module loaded by that dlopen, such a regression is
> >   an ABI break and cannot be tolerated.
> 
> I don't see how dlopen failing, and reporting the failure, could be an
> ABI break. It may be a "feature" regression, but the ABI contract is
> not broken.
> 
If that is easy to support we could do that. However we should deprecate
old behaviour as these are likely subject of bitrot.

> >  Either you preallocate the memory
> >   (eager use of address space, if not necessarily actual storage) or
> >   attempting to allocate it later might fail.  Hence it must be an
> >   explicit choice between the two.  That choice might be at the
> >   granularity of the whole implementation, as in musl, or all the way
> >   down to the granularity of an individual TLS-containing module or
> >   individual module-loading call.  Since glibc has a compatibility
> >   requirement to support lazy allocation, the only possibilities for
> >   the contrary choice are at smaller granularities.
> 
> My preference would be for the granularity to be the symbol version
> level, i.e. deprecate lazy allocation for newly-build applications.
> For 64-bit targets, I hardly even consider this a regression; if you
> have overcommit enabled in the kernel (which is still the default),
> allocations are not going to fail for lack of physical storage, and
> exhausting virtual address space before you exhaust thread id space
> (kernel tids are either 29- or 30-bit; I forget which) is virtually
> impossible.
>
For simplicity per-symbol version granularity would be desirable. 

>From performance perspective a per-module or coarser is desirable. It
would allow to do only one allocation per module/thread and symbols
would be referenced by adding a static offset.

> However I understand that others may want a finer granularity.
> 
> > * Eager allocation could be a new option, and could even be a new
> >   default.  (What the default should be is a separate debate that does
> >   not need to begin now.)
> 
> If it's an option, I would prefer it to be default. The main
> motivation of this preference is to discourage developers from
> unintentionally making DSOs that fail without lazy allocation.
> 
> > ** e.g. A new DF_1_* flag and -z option for a DSO to request it.
> > *** Could be made default for newly-built DSOs.
> > ** New dlopen flag bits to request it.
> > *** Could be made default for newly-built dlopen callers (i.e. new
> >     symbol version of dlopen).
> 
These could be come with adding a tls3 version that will also improve
performance of dynamic tls. There is still big room for improvement.

Main improvement is eliminate function call overhead, this could be done
even for lazy allocation with bit of care. Also it could be made ABI
compatible with older versions if desired.

For architectures that lack a register for tcb we could convince kernel
to implement a memory equivalent. There would be a page mapped
differently for different cores and kernel would save/restore first 8 bytes 
in context switch.

> > * In implementing eager allocation when multiple threads already
> >   exist, it is theoretically possible to do all or almost all of it
> >   asynchronously (i.e. all work done inside the dlopen call on the
> >   thread that called it).  It's trickiest, or perhaps impossible, to
> >   do the final step if the DTV needs to be expanded, from another
> >   thread.  But there is not really any good reason to do a lot from
> >   other threads.  Rich Felker described the most sensible
> >   implementation strategy: do all the allocation in dlopen, but only
> >   actually install those new pointers synchronously in each thread,
> >   inside __tls_get_addr.
> 
And are dlopen/pthread_create functions where you get thread contention?

You could make lock shared by pthread_create/dlopen that protects
modification of pointers. That allows install these asynchronously as
these will not be accessed before dlopen returns.

> One issue that might need some more consideration is freeing of the
> memory. My implementation in musl doesn't have to worry about that
> (although it does "leak" some memory if lots of threads were running
> at dlopen-time and then exit) because dlopen is permanent (dlclose is
> a no-op). But on glibc you want to ensure that both assigned and
> unassigned eagar allocations get freed at some point if the DSO is
> closed or the thread exits; otherwise there are long-term
> dlopen/dlclose scenarios that will leak memory over time.
> 

One problem of large tls allocations is that they 'leak' memory when
they are used only in small part of long-running thread.

As caching these is concerned, we already cache thread stacks which are quite
large.

A thread exit could be handled by same logic as pthread_key_create.
For dlclose destructors do almost same thing.

> I don't remember right off if this was discussed at all so far.
> 
> > * The main request for async-signal-safe TLS use is satisfied by "fail
> >   safe" semantics that preserve lazy allocation semantics: if the
> >   memory is really not available, then you crash gracefully in
> >   __tls_get_addr.  (That is, as much grace as abort, as opposed to the
> >   full range of "undefined behavior" or anything like deadlock.)
> 
> I don't see how this is affected. "Graceful" crash is trivial to
> provide.
> 
> > * How to find all memory containing direct application data is a de
> >   facto part of our ABI.  By "direct" I mean objects that the
> >   application touches itself.  That includes __thread variables just
> >   as it includes global, static, and auto variables.  It excludes
> >   library-maintained caches and the like, but includes any user data
> >   that the public API implies the library holds onto, such as pointers
> >   stored by <search.h> functions.
> > 
> >   This is a distinct issue from the general subject of "using an
> >   alternate allocation mechanism for memory" that Carlos mentioned.
> >   If libc changes how and where it stores its own internal data, that
> >   does not impinge on anything that is a de facto part of the ABI.  If
> >   libc changes how and where it stores application TLS data or other
> >   things in the aforementioned category, that is another thing entirely.
> > 
> >   I mentioned ASan as just one example of the kinds of things that
> >   might care about these aspects of the de facto ABI.  Things like
> >   ASan and conservative GC implementations are the obvious examples.
> >   But the fundamentals of conservatism dictate that we not make a
> >   priori assumptions about what our users are doing and what matters
> >   to them.  As with all somewhat fuzzy aspects of the ABI, there will
> >   be a pragmatic balancing test between "I was using that, you can't
> >   break it!" and, "You were broken to have been relying on that."  But
> >   we must consider it explicitly, discuss it pragmatically, and be
> >   circumspect about changes, especially the subtle ones.  The change
> >   at issue here is especially subtle in that it could be a silent time
> >   bomb that does not affect anybody in practice (or that nobody
> >   realizes explains strange new flakiness they experience) for
> >   multiple release cycles.  For example, if before the change a
> >   __thread variable (in a dynamic TLS module) sometimes was the only
> >   root holding a GC'able pointer and the GC noticed it there, but
> >   after the change the GC doesn't see that root.  If this bug is
> >   introduced tomorrow, it could be a long time before the confluence
> >   of when collections happen, whether other objects hold (or appear to
> >   hold) the same pointer, and the effects of reclamation, add up to
> >   make someone experience a failure they notice.
> > 
As I already said it is better to fix a root cause than symptoms. Here by
making malloc in general signal safe. That is easy to add and has a
performance impact like consistency check already there.

Then we need add requirement for alternative allocators to be
signal-safe which could be done by providing a wrapper library.

> >   How to find threads' stacks and static TLS areas is already
> >   underspecified (improving that situation is a subject for another
> >   discussion).  But even for that, we would be quite circumspect about
> >   making a change that could break methods existing programs are using
> >   to acquire that information.
> 
> Could you elaborate on what methods existing programs are using?
> 
snip

> > I have no great quarrel with the thoroughness or conservatism of the
> > vetting of the implementation details or first-order ABI issues of
> > what's gone in.  (I am not entirely sanguine about all that, but close
> > enough that I've decided not to participate in the detailed review.)
> > But the mere fact that in a few months a >100 messages of discussion,
> > I'm the first to raise these subtleties (that I really thought would
> > have been fairly obvious to people here) gives me great pause about
> > the whole endeavor.
> 
> I don't think you're the first; most of the issues you raised are
> things I considered obvious, and I thought they were at least
> mentioned. The matter of GC roots and ASan might not have been
> covered, but in fairness, it's not reasonable to expect everyone to be
> an expert on every way some third-party software is abusing glibc
> internals. One thing that would be nice to come out of this would be
> if we could arrive at some sort of friendly procedure for third-party
> projects wanting/needing this kind of poking-at-internals access to
> contact the glibc team, explain the need, and work out whether there's
> any way a reasonable public interface can be provided.

Like this feature request:

https://sourceware.org/bugzilla/show_bug.cgi?id=16291


> Even if they
> couldn't use the public interface immediately (e.g. need to support
> old versions in the wild), they could at least have their software
> ready to use public interfaces for the future, so that their "poking
> at internals" would only need to be compatible with a finite set of
> past releases rather than an "infinite set" of future releases.
> 

> > As I said, I'm not specifying any conclusions.  I'm fairly confident
> > we can find a middle road that is appropriately conservative while
> > offering improvement for the pain point.  But we have yet to even
> > begin discussing what IMHO should be considered a major obstacle to
> > making this change while keeping with our conservative principles.
> 
We should know where a techical debt lies and what is valuable, we need to 
know what problems past decisions caused so we could fix them and avoid 
similar ones in future.
References:
- TLS redux
  - From: Roland McGrath
- Re: TLS redux
  - From: Rich Felker
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]