This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races


On Mon, 2016-11-28 at 19:03 +0000, Szabolcs Nagy wrote:
> Before sending an updated version of
> https://sourceware.org/ml/libc-alpha/2016-01/msg00480.html
> i try to provide high level description of the issues (that can go
> into the concurrency notes of the patch) with the hope that agreeing
> on the design will allow less patch review iterations.

That's a good approach.  Thanks.

> dlopen (operations in program order):
> 
> 	Load a module and its dependencies, assign a modid to each
> 	while incrementing GL(dl_tls_max_dtv_idx) (this is the dtv
> 	and slotinfo index of the module).
> 
> 	Add each module to GL(dl_tls_dtv_slotinfo_list) using
> 	_dl_add_to_slotinfo.  Assign the next generation count to
> 	these slotinfo entries.
> 
> 	Increment the GL(dl_tls_generation) counter.
> 
> 	Update the dtv of the current thread following the slotinfo
> 	changes by calling _dl_update_slotinfo (only resizes dtv and
> 	fills it with unallocated slots, actual tls allocation and
> 	initialization is deferred).
> 
> pthread_create:
> 
> 	Call _dl_allocate_tls_init to allocate dtv and tls for the
> 	new thread to be created, covering all currently loaded modules.
> 	("currently loaded" means up to a generation count, the dtv and
> 	tls for modules with larger generation count are allocated lazily
> 	on first access).
> 
> Other apis:
> 
> 	dlclose can write the relevant globals too, tls access and
> 	dl_iterate_phdr read them (i focus on dlopen and pthread_create
> 	first, similar concerns affect these apis).  dlmain can write
> 	these globals too, but i think that it is single threaded
> 	(early startup, no pthreads yet, can audit modules change that?).
> 
> In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation,
> dl_tls_dtv_slotinfo_list entries) are written by dlopen and
> pthread_create reads them when initializing the dtv of a new thread.
> 
> Current concurrency design:

Generally, I'd suggest to start one level of abstraction higher: What
are the abstract operations that we need to perform?  Which of these
need to appear to be atomic wrt. which other operations?  Which ordering
relations do need to hold in every execution?
(In contrast, you explain things below closer to the implementation
level (eg, this lock is acquired for these operations, etc.))

There are a couple of reasons why I'm suggesting that.  First, it helps
people that are not familiar in detail with all of the requirements
(such as myself :).

Second, and perhaps that's more immediately importantly from your
perspective, it will likely guide us to what we need to do in the
implementation.  In my experience, the tighter you get in your
understanding of a concurrency problem on the abstract level, the better
you can understand what the options are and how the solution should look
like.
For example, assume some operation needs to be atomic.  A natural
reaction is to use a lock to guarantee mutual exclusion -- but do we
really need a lock?  One question to ask then is whether other
operations really depend on all the details of the operation, or whether
it can be reduced to something smaller and its effects "replayed" at
other threads?  If so, then one can often announce that the operation
has happened in some other mechanism that summarizes the current state
of the world.  Otherwise, is it actually important which thread executes
the operation?  If it is, then we may actually need a lock because the
abstract essence of the concurrency problem is that thread B has to wait
for thread A to perform the operation.

Third, this abstract description gives the code a vocabulary that can
then be used to cleanly describe why the actual implementation is
supposed to work.

In your v3 patch from January, you had something in that direction: a
list of readers and writers.  I'd suggest to take this one step further.
For example, there are certain operations like pthread_create and
sub-operations these have to perform.  Then there's code that uses TLS,
dlopen calls, etc.  The language specs and POSIX make requiremets on
these, which already restricts the valid executions that we need to
support.  That's the start of the requirements.
Then there's requirements or a wish-list of what we'd like to have, such
as AS-Safe.
Then we can break them down into the abstract steps we need to perform,
such as initialize TLS, or memory reclamation.  Combined with the
constraints we have on these (eg, uses of a piece of memory need to
happen before the memory is released), this gives us a high-level plan
of the concurrency problem.
We can then iterate and refine that until we have something that's
implementable.  Doing that iterating on an abstract level will allow us
to see possibilities to apply patterns such as a need for mutual
exclusion, a need for a simple ordering of events, a need to do atomic
snapshots, etc.

If you skip the abstract level, it becomes too complex quickly, and you
run into the risk of inventing a fully custom concurrent algorithm,
instead of being able to apply known techniques (eg, locking is one, but
also only just one).  See some of my questions below for examples.

Note that by "abstract", I don't mean vague or so high-level that it
lacks important detail -- but rather something that has been reduced to
the core of what we need to reason about when treating this as a
concurrency problem.  This is why I've talked about atomicity and
ordering requirements.

> dlopen holds GL(dl_load_lock) while writing the relevant globals (using
> volatile loads and stores).  The lock serializes dlopen calls, but there
> is no synchronization with pthread_create (other than mmap and mprotect
> calls which are likely synchronized but not part of the c11 memory model).
> 
> pthread_create accesses the globals without locks and uses volatile loads:
> dtv size and valid entries in the slotinfo list are determined based on
> GL(dl_tls_max_dtv_idx) with the following asserted assumptions:
> (a) slotinfo list have at least GL(dl_tls_max_dtv_idx) entries set up,
> (b) slotinfo entries have generation count <= GL(dl_tls_generation).
> These would be wrong even if all memory accesses were mo_seq_cst: slotinfo
> entries are added after modid is assigned and the generation counter is
> incremented even later.
> 
> Current design bugs in glibc:
> 
> (0) There are data races on the relevant globals. And there are asserts
> that are not guaranteed (a,b above). This is what i was trying to fix.
> 
> (1) Any access to link_map struct is invalid if neither GL(dl_load_lock)
> nor GL(dl_load_write_lock) are held, because a concurrent dlclose may
> free the map, this affects:
> 	pthread_create (_dl_allocate_tls_init reads map.l_tls_offset)
> 	__tls_get_addr (_dl_update_slotinfo reads map.l_tls_offset)

You are describing the current issue in the implementation, which is
perfectly fine.  What is the underlying synchronization problem we need
to solve?  Do we just need to reclaim memory after all of it's uses?
Locking may get this done, but other techniques such as hazzard pointers
might be better, in particular if we can reclaim memory lazily and don't
want to use blocking synchronization.

> (2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement
> GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be
> inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are
> held (e.g. concurrent dlclose and dlopen between reading the .gen and
> .map members of the slotinfo entry), this affects:
> 	pthread_create (_dl_allocate_tls_init, reads .gen and .map)
> 	__tls_get_addr (_dl_update_slotinfo, reads .gen and .map)

Can we version the modids?  Or do we generally need a mechanism for
atomic snapshots?  What are the abstract operations and the requirements
for them?

> (3) tls access must be as-safe, but currently tls is allocated at first
> access if the containing module was not loaded at the time the accessing
> thread was created.  This means __tls_get_addr may malloc or lock
> GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks,
> oom crashes or memory corruption if it is in a signal handler (malloc in
> signal handler is also problematic for malloc interposition).

Are signal handlers allowed to access TLS?  I don't know what POSIX
requires (or doesn't disallow), but C++, for example, is moving to not
allowing TLS (if I remember the draft wording correctly).

> (4) allocation failures can oom crash in pthread_create and dlopen
> (_dl_resize_dtv) and at tls access (various mallocs) or may leave
> inconsistent state behind in dlopen (_dl_add_to_slotinfo or any
> later call in dlopen that fails: tls changes are not rolled back).

Do we need to be able to publish changes atomically?  If so, there are
known techniques for that.

Do we perhaps can deal with most of all of this if we had simple
transactions?  If these aren't arbitrary operations but in some way
limited, and can be used for lots of the operations we need to
synchronize, this may perhaps be the easiest way to implement all of
this.  Once we know about the abstract operations, we can better assess
whether that would be a helpful approach.

> (5) GL(dl_tls_generation) is not checked for overflow consistently and
> slotinfo entries can get assigned an overflowed generation count, its type
> is size_t and both dlopen and dlclose increment it, so overflow may be a
> concern on 32bit targets.

Can the size be extended on 32b targets?  If so, we can build a
monotonically increasing, larger atomic counter on 32b targets too.

> (6) dlclose holds GL(dl_load_lock) while running ctors so if those ctors
> wait for anything that may take the same lock there is a deadlock.

Do you mean dtors?

> Proposed fix(es):
> 
> (0) can be fixed by adding atomics and removing invalid asserts:
> It can be guaranteed that whatever GL(dl_tls_generation) is read in
> pthread_create, all slotinfo entries with that generation will be read
> too and the read GL(dl_tls_max_dtv_idx) is no less than the modids of
> those entries.  Then walking the slotinfo list until the end and only
> considering ones up to the observed generation count would set up the
> dtv consistently without races. Assuming there is no concurrent dlclose.
> (I think my previous patch did this.)
> 
> OTOH (1) and (2) clearly point to the direction that pthread_create
> should take the lock, in which case thread creation performance may go
> down a bit and the influence of (6) increases (but that is broken anyway)
> and we didn't do anything about __tls_get_addr. (I don't see other reason
> to avoid taking the lock).

Before focusing on taking the lock, I'd suggest to look at the bigger
picture (ie, the abstract operations and atomicity and ordering
requirements).
It can be possible that a clean redesign is actually simpler than trying
to fix a system that is already complex and which we don't understand
100%.

> Fixing (3) or (6) would need significant redesign.
> 
> Fixing (4) and (5) may be possible independently.
> 
> I think these are the immediately available options:
> 
> - lock free pthread_create that is broken with concurrent dlclose,
> - pthread_create that locks and may deadlock when synchronized with a ctor,
> - or current brokenness (concurrent pthread_create and dlopen asserts).

Looking at the abstract operations will allow us to make such
conclusions.  Also, if we document these, it will be much clearer to
future developers why we did what we did (ie, the reasoning behind
options and decisions such as the ones you list above).

I hope this helps.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]