This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS

From: Torvald Riegel <triegel at redhat dot com>
To: Alexandre Oliva <aoliva at redhat dot com>
Cc: Andreas Schwab <schwab at linux-m68k dot org>, libc-alpha at sourceware dot org
Date: Sun, 07 Jun 2015 22:39:07 +0200
Subject: Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS
Authentication-results: sourceware.org; auth=none
References: <orvbf5ffyt dot fsf at livre dot home> <1433326788 dot 21461 dot 81 dot camel at triegel dot csb> <ora8whexn9 dot fsf at livre dot home> <1433344426 dot 21461 dot 202 dot camel at triegel dot csb> <orwpzkee1h dot fsf at livre dot home> <1433419889 dot 21461 dot 294 dot camel at triegel dot csb> <ork2vidb0s dot fsf at livre dot home> <1433500798 dot 21461 dot 332 dot camel at triegel dot csb> <orvbf2aw5r dot fsf at livre dot home>

On Fri, 2015-06-05 at 14:43 -0300, Alexandre Oliva wrote:
> On Jun  5, 2015, Torvald Riegel <triegel@redhat.com> wrote:
> 
> >> Is the l_tls_offset field the data you're talking about?
> 
> > Yes, in this case.
> 
> > (There may be other data that is initialized once inside the critical
> > section and then accessed outside.  Accesses to those data can use
> > nonatomic accesses if data like l_tls_offset makes them data
> > data-race-free.)
> 
> Ok, this means we can take them out of the analysis and focus on
> l_tls_offset alone.  Let's do that, shall we?

You can't just ignore them because they can put requirements on the
flag.  *Iff* they don't exist, we can relaxed the synchronization on the
flag.

> 
> Consider this:
> 
> enum spin { NEGATIVE = -1, UNKNOWN = 0, POSITIVE = 1 };
> 
> void *init () {
>   static enum spin electron;
>   take lock;
>   electron = UNKNOWN;
>   release lock;
>   return &electron;
> }
> 
> bool make_positive (void *x) {
>   enum spin *p = x;
>   take lock;
>   if (*p == UNKNOWN)
>     *p = POSITIVE;
>   release lock;
>   return *p == POSITIVE;
> }
> 
> bool positive_p (void *x) {
>   enum spin *p = x;
>   if (*p != NEGATIVE) {
>     take lock;
>     if (*p == UNKNOWN)
>       *p = NEGATIVE;
>     release lock;
>   }
> 
>   return *p == POSITIVE;
> }
> 
> The program has to call init to get the pointer, so that happens before
> any uses of the pointer.

OK.  init sets the value unconditionally, which seems either
counterproductive, or init is called exactly once and that call happens
before any use by another thread.  If the latter (and that seems to be
the case as you indicate below):
(1) the critical section is not required
(2) we should use an atomic access for clarity of code (because we have
no atomic types), but it can be considered initialization too and then
our exception applies that we don't currently enforce atomic accesses
for initialization (a brief code comment is useful in this case
nonetheless).

> From then on, any threads may call
> make_positive or positive_p, however many times they wish, passing them
> the pointer returned by init.  init should not be called again.  There
> is no other data associated with the electron or the global lock (not
> declared here) that guards it.

OK.  Then this isn't really double-checked locking, but you're
implementing a CAS with locks.  (More generally, you are implementing
single-memory-location consensus.  See the single-memory-location state
machine comment you make below.)  One could argue that this is a
degenerated double-checked locking pattern, but double-checked locking
is really meant for things like pthread_once where there is other data
that has to be protected with the critical section.  If all one wants is
single-memory-word consensus, then that's what the very basic HW atomic
instructions already provide.
(This is why I asked whether you are just implementing a CAS in one of
my past emails.)

The fact that there is no other data associated with your state machine
must be pointed out in the comments.  It's often not the case that
something is indeed "freestanding", so if it is we want to briefly note
why (using a comment in the code).  We got similar cases wrong in glibc
quite often (e.g., the previously incorrect pthread_once
implementation), so such a code comment tells everyone we checked this
case, and why it's correct.

> The enum type is opaque, not visible to
> callers, and its alignment and size ensure it will always be loaded and
> stored atomically.

Careful here.  The alignment and size of the type may *allow* a compiler
(and our atomic operations) to actually work and make access atomic.
But the compiler is *not required* to do so if plain nonatomic accesses
are used; it could load/store byte-wise, it could reload, or it could
store speculative values.  Unless you use atomic accesses, there's no
guarantee that what looks like a single load or store in source code
actually ends up as exactly one atomic, full-size load or store in the
generated code.

Thus, if you'd add a comment on the type, you should say that it is
compatible (or sufficient for) atomic accesses.

> This is supposed to implement a state machine in which the spin starts
> as UNKNOWN, but once it is determined to be POSITIVE or NEGATIVE, it
> won't ever change again.  make_positive will turn UNKNOWN spins into
> positive, whereas positive_p will turn them into NEGATIVE.

OK.  So, this establishes consensus for whether POSITIVE or NEGATIVE is
the final outcome.  (And I guess UNKNOWN is never exposed to other
code.)

> This is analogous to the code we have in place right now.
> 
> The change I'm proposing is an optimization to positive_p, that changes
> the following line:
> 
>   if (*p != NEGATIVE) {
> 
> to:
> 
>   if (*p == UNKNOWN) {
> 
> because we don't have to take the lock when *p has already advanced to
> POSITIVE.

OK.  Note that this was the clearest bit of your patch all way along.
It's everything around that (and the actual synchronization) that was
problematic / unclear.

> I have two questions for you about this proposed change:
> 
> Is it really that hard to see that the change is correct, safe, and a
> strict performance improvement?

The intent is good.

The synchronization has data races (and also had them before, but that's
no reason to keep doing the incorrect thing).

It is a performance improvement.  Using a CAS instead of a critical
section would make it even faster for the first accesses.
For example, make_positive could do that:

  /* Establish POSITIVE if no consensus yet.  See [...] for why
     relaxed MO is sufficient.  */ 
  spin s = atomic_load_relaxed (p);
  while (s == UNKNOWN)
    atomic_compare_exchange_weak_relaxed(p, &s, POSITIVE);

If you want to keep the locks, I suggest mentioning the equivalent CAS
anyway because it's conveys the intent more clearly in this case.

> What is the *minimum* set of atomic loads and stores that we have to
> introduce, replacing unadorned loads and stores, to make the whole thing
> compliant with our standards, should this be GNU libc code?

All accesses to *p must use relaxed atomic accesses (the initialization
may be an exception if the change complicates things there (e.g., if
it's a memset that also covers other memory locations), but if it's easy
it's better to use a relaxed-MO atomic access there as well for
clarity).

> >> > One thing to note is that acquire loads synchronize with release stores
> >> > (or stronger) on the *same* memory location.  An acquire load does not
> >> > synchronize with operations on a lock, unless the acquire load peeks
> >> > into the lock and does an acquire load on the lock's state or such.
> >> 
> >> > Therefore, when you think about which effect an acquire load has,
> >> > consider which release store you are actually thinking about.  An
> >> > acquire operation does not have an effect on it's own, only in
> >> > combination with other effects in the program.  This is also why we want
> >> > to document which release store an acquire load is supposed to
> >> > synchronize with.
> >> 
> >> > Thus, which release store are you thinking about in this case?
> >> 
> >> Nothing but l_tls_offset.
> 
> > I haven't see a release-MO atomic store to l_tls_offset yet.
> 
> Of course, there isn't any ATM.  What I'm saying is that l_tls_offset is
> the only memory location that's relevant in this situation.
> 
> The part of the dynamic loader that loads the module that defines the
> TLS variable is init, the part of the dynamic loader that resolves a TLS
> relocation in a way that requires the variable to be in static TLS is
> make_positive, and the part of tls_get_addr that I'm modifying in the
> patch is is_positive.

Just to clarify on previous comments you make: And init() always
happens-before make_positive or is_positive?

> Any static TLS offset maps to POSITIVE,
> FORCED_DYNAMIC maps to NEGATIVE, and NO_TLS_OFFSET maps to UNKNOWN.
> 
> > Nor a release fence.
> 
> The release fences are the unlock operations after each and every one of
> the stores.

Hmm.  This still seems somewhat inconsistent.  You are arguing that you
do not need an acquire load on the consensus implementation above,
because l_tls_offset would stand on its own and program logic doesn't
need ordering dependencies between this consensus and anything else.
Yet you seem to think the release fence is necessary.  Do you think the
release fence is necessary for something, and if so, what is it and
where is the matching acquire?

Also note that an unlock operation on a lock is *not* guaranteed to have
the same effects as an atomic_thread_fence_release.  A compiler can
reorder atomics into a critical section (under certain conditions), but
it can't typically reorder to before an atomic_thread_fence_release
(there are corner cases though where it can merge with other accesses
before the fence).  The hardware may or may not treat unlocks (ie,
release stores on a *particular* memory location) and fences the same.

Follow-Ups:
- Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS
  - From: Alexandre Oliva

References:
- [PR18457] Don't require rtld lock to compute DTV addr for static TLS
  - From: Alexandre Oliva
- Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS
  - From: Torvald Riegel
- Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS
  - From: Alexandre Oliva
- Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS
  - From: Torvald Riegel
- Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS
  - From: Alexandre Oliva
- Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS
  - From: Torvald Riegel
- Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS
  - From: Alexandre Oliva
- Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS
  - From: Torvald Riegel
- Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS
  - From: Alexandre Oliva

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]