This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS


On Wed, 2015-06-03 at 17:24 -0300, Alexandre Oliva wrote:
> On Jun  3, 2015, Torvald Riegel <triegel@redhat.com> wrote:
> 
> > So you think a reload by the compiler would be bad.
> 
> Yeah.  It's a double-checked lock entry with sugar on top.  We only need
> to take the lock if l_tls_offset is unresolved (NO_TLS_OFFSET), but if
> it is resolved, we want different behavior depending on whether it is
> FORCED_DYNAMIC_TLS_OFFSET, or something else (static TLS offset).
> 
> > This can only be bad if there is concurrent modification, potentially
> 
> Concurrent modification is made while holding the lock.

Unless *all* accesses happen while holding the lock, there are pairs of
accesses that are not ordered by happens-before; if there is a write in
any of those pairs, you must use atomics to prevent a data race.

(If happens-before is ensured through *other* synchronization (e.g.,
another lock), then this needs to be considered too.)

Applied to a double-checked locking pattern, this means that all data
accessed outside the critical section, and is also checked and modified
inside the critical section, must use atomic accesses.

> It shouldn't happen in the same thread, at least as long as TLS is
> regarded as AS-Unsafe, but other threads could concurrently attempt to
> resolve a module's l_tls_offset to a static offset or forced dynamic.

OK.  So we're dealing with inter-thread concurrency here.

> > Thus, therefore we need the atomic access
> 
> I'm not convinced we do, but I don't mind, and I don't want this to be a
> point of contention.
> 
> > AFAIU, you seem to speak about memory reuse unrelated to this
> > specifically this particular load, right?
> 
> Yeah, some other earlier use of the same location.
> 
> > But that sounds like an issue independently of whether the specific
> > load is an acquire or relaxed load.
> 
> Not really.  It is a preexisting issue, yes, but an acquire load would
> make sure the (re)initialization of the memory into a link map,
> performed while holding the lock (and with an atomic write, no less),
> would necessarily be observed by the atomic acquire load.  A relaxed
> load might still observe the un(re)initialized value.  Right?

I can't follow you here.

One thing to note is that acquire loads synchronize with release stores
(or stronger) on the *same* memory location.  An acquire load does not
synchronize with operations on a lock, unless the acquire load peeks
into the lock and does an acquire load on the lock's state or such.

Therefore, when you think about which effect an acquire load has,
consider which release store you are actually thinking about.  An
acquire operation does not have an effect on it's own, only in
combination with other effects in the program.  This is also why we want
to document which release store an acquire load is supposed to
synchronize with.

Thus, which release store are you thinking about in this case?

> > We established before that you want to prevent reload because there are
> > concurrent stores.  Are these by other threads?
> 
> Answered above.
> 
> > If so, are there any cases of the following pattern:
> 
> Dude.  Of course not.  None of them use atomics.  So far this has only
> used locks to guard changes, and double-checked locks for access.

Think conceptually, and consider that atomics don't synchronize with
locks unless the atomics access the lock state themselves.  Thus, the
atomics in there are the atomics inside and outside of the critical
section(s).

> > storing thread;
> >   A;
> >   atomic_store_relaxed(&l_tls_offset, ...);
> 
> > observing thread:
> >   offset = atomic_load_relaxed(&l_tls_offset);
> >   B(offset);
> 
> > where something in B (which uses or has a dependency on offset) relies
> > on happening after A?
> 
> Let's rewrite this into something more like what we have now:
> 
>   storing thread:
>      acquire lock;
>      A;
>      set l_tls_offset;
>      release lock;
> 
>   observing thread:
>      if l_tls_offset is undecided:
>        acquire lock;
>        if l_tls_offset is undecided:
>          set l_tls_offset to forced_dynamic; // storing thread too!
>        release lock;
>      assert(l_tls_offset is decided);
>      if l_tls_offset is forced_dynamic:
>        dynamicB(l_tls_offset)
>      else
>        staticB(l_tls_offset)
> 
> The forced_dynamic case of B(l_tls_offset) will involve at least copying
> the TLS init block, which A will have mapped in and relocated.  We don't
> take the lock for that copy, so the release after A doesn't guarantee we
> see the intended values.  Now, since the area is mmapped and then
> relocated, it is very unlikely that any cache would have kept a copy of
> the unrelocated block, let alone of any prior mmapping in that range.
> So, it is very likely to work, but it's not guaranteed to work.

Try to ignore what you think a particular cache implementation may or
may not be likely to do.  Instead, please think about whether the
program enforces the happens-before relationships it relies on.

So, the effects in A are mmap and relocation of the TLS init block
(which involves stores to the memory locations copied by B).  B needs
both to happen-before itself to work correctly.  A can be executed by a
different thread than the thread executing B.  Correct?

In this case, A needs to happen-before B, and you need a release store
for l_tls_offset and an appropriate acquire load of l_tls_offset on the
observer side.  Those establish the happens-before between A and B.

Alternatively, A needs to (conceptually) include issuing a release fence
after the effects, so that a relaxed store is sufficient on the storing
side.
For mmap specifically, we may be able to assume that all mmap
implementations are a syscall, the kernel does the mmap (ie, the
effect), and the return from the kernel is always a release fence.
However, then we need to document this for mmap, if we want to rely on
it.
And on the observer side, we'd still need an acquire load, unless we can
put an acquire fence somewhere else.

Note that whether the storing thread has acquired the lock or not
doesn't matter in the analysis of this case.

> As for the static TLS case of B(l_tls_offset), the potential for
> problems is similar, but not quite the same.  The key difference is that
> the initialization of the static TLS block takes place at the storing
> thread, rather than in the observing thread, and although
> B(l_tls_offset) won't access the thread's static TLS block, the caller
> of __tls_get_addr will.  (and so will any IE and LE TLS access)

OK. So let's consider callers to be part of B(l_tls_offset).

> Now, in order for any such access to take place, some relocation applied
> by A must be seen by the observing thread, and if there isn't some
> sequencing event that ensures the dlopen (or initial load) enclosing A
> happens-before the use of the relocation, the whole thing is undefined;
> otherwise, this sequencing event ought to be enough of a memory barrier
> to guarantee the whole thing works.  It's just that the sequencing event
> is not provided by the TLS machinery itself, but rather by the user, in
> sequencing events after the dlopen, by the init code, in sequencing the
> initial loading and relocation before any application code execution, or
> by the thread library, sequencing any thread started by module
> initializers after their relocation.

If that's the situation in the static case, it seems we do not need an
acquire load nor release store because there is either no concurrent
access, naturally, or the user has to ensure happens-before.  We can say
something like this (feel free to fix the TLS terminology):

In the static case, we can assume that the initialization/relocation of
a TLS slot [I mean A] always happens-before the use of this TLS slot:
(1) in case of dlopen, we require the user to ensure that dlopen
happens-before other threads execute the new TLS;
(2) in case of initial loading, this will happen-before any execution of
application code; and
(3) in case of [module initializers case, don't know how to phrase that
properly...]
Therefore, relaxed MO is fine for this load and the store in [the
storing side function].

This documents the assumptions the function is making, and thus what
other things it relies on.  Somebody reading the code of the function
sees the comment, knows why there's just a relaxed MO load, and can
connect it to the rest of the mental model of TLS synchronization.  We
do need to document this to be able to understand how the
synchronization is intended to work, and to be able to cross-check this
with the assumptions and documentation of other related functions.

> Which means to me that a relaxed load might turn out to be enough, after
> all.
> 
> > I'm trying to find out what you know about the intent behind the TLS
> > synchronization
> 
> FWIW, in this discussion we're touching just a tiny fraction of it, and
> one that's particularly trivial compared with other bits :-(
> 
> > Does dlopen just have to decide about this value
> 
> It does tons of stuff (loading modules and dependencies, applying
> relocations, running initializers), and it must have a say first.  E.g.,
> if any IE relocation references a module, we must make it static TLS.
> Otherwise, dlopen may leave it undecided, and then a later dlopen might
> attempt to make it static TLS again (and fail if that's no longer
> possible), or an intervening tls_get_addr may see it's undecided and
> make the module's TLS dynamic.
> 
> > I disagree.  You added an atomic load on the consumer side (rightly
> > so!), so you should not ignore the producer side either.  This is in the
> > same function, and you touch most of the lines around it, and it's
> > confusing if you make a partial change.
> 
> You're missing the other cases elsewhere that set this same field.

What do you mean?  How is it any better if you don't fix it properly in
the functions you have looked at and modified, just because there are
more problems elsewhere?

> > Let me point out that we do have consensus in the project that new code
> > must be free of data races.
> 
> Is a double-check lock regarded as a race?  I didn't think so.

*Correct* double-checked locking isn't.  But correct double-checked
locking doesn't have data races, and this code here does (unless there's
some constraint on execution that I'm not aware of and you haven't told
me about).

Look at your example code above.  "set l_tls_offset;" in the storing
thread can be concurrent with the first access of "l_tls_offset" (or the
ones in the branches after the critical section).  Concurrent here means
not ordered by happens-before (unlike in the static case, because there
we rely on other happens-before, as you point out above).  One of the
concurrent accesses is a store.  This is a data race.

> So, I'm
> proposing this patch, that reorganizes the function a bit to make this
> absolutely clear, so that we can get the fix in and I can move on,
> instead of extending any further the useless part of this conversation,
> so that we can focus on the important stuff.

Adding data races, or not fixing data races when you touch the code, is
not correct -- thus discussing whether you have a data race in your path
is absolutely not useless.

We agreed on the no-data-races rule for a reason.  Please adhere to this
like you do for other project rules.  If you disagree with the rule,
bring this up again, as a separate topic.  Until that happens, I'd like
to not have to argue over this again and again.

> How's this?

I'll comment on the revised patch.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]