This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix

From: Torvald Riegel <triegel at redhat dot com>
To: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, Marcus Shawcroft <Marcus dot Shawcroft at arm dot com>, Ramana Radhakrishnan <Ramana dot Radhakrishnan at arm dot com>
Date: Fri, 24 Apr 2015 19:37:50 +0200
Subject: Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
Authentication-results: sourceware.org; auth=none
References: <553793A3 dot 7030206 at arm dot com> <1429718899 dot 6557 dot 17 dot camel at triegel dot csb> <5537D6F8 dot 1030509 at arm dot com> <5538E0B7 dot 60800 at arm dot com>

On Thu, 2015-04-23 at 13:08 +0100, Szabolcs Nagy wrote: 
> On 22/04/15 18:14, Szabolcs Nagy wrote:
> > On 22/04/15 17:08, Torvald Riegel wrote:
> >> Given that you have looked at the code, could you give a rough estimate
> >> of how much churn it would be to make the TLS code data-race-free?
> > 
> > elf/tlsdeschtab.h:
> > _dl_tlsdesc_resolve_early_return_p
> > sysdep/{x86_64,i386,arm,aarch64}/tlsdesc.c:
> > _dl_tlsdesc_resolve_rela_fixup
> > _dl_tlsdesc_resolve_hold_fixup
> > 
> 
> i think these all need to use atomics for accessing
> td->entry for c11 correctness, but it's hard to tell
> what's right since c11 only talks about synchronizing
> with c code, not asm.

That's true, but if we can require the asm code to be conceptually
equivalent to some C11 code, then we can specify and document the
concurrent algorithm or synchronization scheme based on C11 as
foundation, and either use C11 atomics to implement it or use code the
compiler would generate for C11 atomics if implementing in asm.

That doesn't cover cases where in the asm, we would want to use
synchronization that can't be represented through C11 atomics, or that
is incompatible with C11 atomics.  But so far it doesn't look like this,
or have you observed such cases?
(consume MO might be a case, but I guess for that we could still wave
our hands and pretend compilers would simply generate the "intuitively
generated" code).

> the td->entry can be in 3 states during lazy resolution:
> 
> * init: retries the call of entry until caller==entry
>   (using a double-checked locking mechanism, then it
>   - grabs GL(dl_load_lock)
>   - changes the entry to the hold state
>   - does the resolver work
>   - changes arg and entry to the final value
>   - releases the lock to wake the waiters
>   - calls the new entry)
> * hold: waits on the GL(dl_load_lock)
>   (then retries calling the entry when it's woken up)
> * final state: (static, dynamic or undefweak callback)
>   calculates the tls offset based on arg
>   (currently without any locks which is bad on weak
>   memory models)

Could you point me to (or confirm that) a new load of td->entry happens
if caller != td->entry?  In the asm bits you posted so far, it seems
that if the call to *td->entry returns, actual TLS data is loaded.  For
example, you posted this:

  ldr x1, [x0]    // load td->entry
  blr [x0]        // call it

entryfunc:
  ldr x0, [x0,#8] // load td->arg

I'm asking because in tlsdesctab.h we have:

static int
_dl_tlsdesc_resolve_early_return_p (struct tlsdesc volatile *td, void *caller)
{
  if (caller != td->entry)
    return 1;

  __rtld_lock_lock_recursive (GL(dl_load_lock));
  if (caller != td->entry)
    {
      __rtld_lock_unlock_recursive (GL(dl_load_lock));
      return 1;
....

If _dl_tlsdesc_resolve_early_return_p returns 1,
_dl_tlsdesc_resolve_rela_fixup just returns.  If the next thing we do is
load td->arg, we need another acquire-MO load for td->entry in
_dl_tlsdesc_resolve_early_return_p.  A relaxed-MO load (or the current
non-atomic access, ingoring DRF) would only be sufficient if all that
caller != atomic_load_relaxed (&td->entry) leads to is further busy
waiting (eg, calling *td->entry again).  But if we really expect
caller != td->entry to have meaning and tell us something about other
state, we need to have another barrier there to not have a typical
double-checked-locking bug.

(This shows another benefit of using atomics for all concurrent
accesses: It forces us to make a conscious decision about the memory
orders, and document why.  If a relaxed-MO access is sufficient, it
should better have a clear explanation...)

> 
> the code for tls access is generated by the compiler so
> there is no atomics there: the loaded entry can be in
> init, hold or final state, the guarantee about the state
> of arg must come from the target arch memory model.

I understood there are no C11 atomics used in the asm implementation
bits.  I'm thinking about abstract the synchronization scheme we want to
use in the implementation, represented through C11 atomics
synchronization; the asm would just do something equivalent (see my
comments about the atomics implementation being effectively part of the
ABI).

> and to answer my earlier question about the hold state:
> it is needed to avoid the spinning in the double-checked
> locking in init.
> 
> >> It also looks as if the x86_64 variant of tlsdesc.c is, before your
> >> changes and ignoring some additional comments, very similar to the
> >> aarch64 variant.  Can we get one proper tlsdesc.c (data-race-free and
> >> using the C11 model) that can be used on several archs?  This would
> >> certainly decrease maintenance overhead.
> >>
> > 
> > should be possible i think: these functions are called
> > from asm to do the lazy resolution
> > 
> > but i have to check the arch specific details.
> 
> looking at this, but it seems to be significant work:
> 
> * i386 and arm determine the caller differently
> * i386 uses special calling convention from asm
> * arm handles local symbols specially

Okay.  But do they still use the same abstract synchronization scheme,
and just differ in a few aspects?  Or are the synchronization schemes
significantly different?

> i think the way to move forward is

I think step 1 should be to document the synchronization scheme on an
abstract, algorithmic level.  Using C11 atomics in this pseudo-code.  We
want to document the abstract level because it's typically much easier
to reason about the abstraction than the concrete implementation in case
of concurrent code, and because confirming that an implementation
matches the abstraction is typically also much easier than inferring the
abstraction from a concrete implementation.

So, for example, if we use a kind of double-checked locking here,
document that, show the pseudo code with C11 atomics, and then refer
back to this when documenting why certain memory orders on atomic
operations (or asm instructions) are sufficient and equivalent to the
abstract algorithm.  In our case, this would mean saying that, for
example, we need load Y to use acquire MO so that it synchronizes with
release-MO store X, making sure that a load B that is sequenced after X
reads from store A (which is a core part of double-checked locking, so
could also be done at the abstract algorithm docs and then just referred
back to from the concrete implementation code).

> * fix the correctness bug now on aarch64

OK.  When doing that, check and document equivalence to the abstract
synchronization scheme.  When you do that, please use atomic_* for all
synchronization that you add or change.
It might be easiest to also change all atomic accesses that you checked
and documented to using atomic_*; this isn't much work compared to the
checking, and it becomes obvious what has been checked.

> * decide if lazy static tls resolver is worth it
> * do the refactoring so tlsdesc is common code
> * use c11 atomics

I'd prefer C11 atomics to be used right away when fixing things;
however, if you mean transforming and checking all other TLS code
unrelated to the bug you're looking at, then I agree that this can be
done incrementally and after you fixed this bug.

> the fix for arm is a lot harder (because atomics are
> different on different versions of arm, an ifunc based
> dispatch could in principle solve the dmb vs no-dmb
> issue for the lazy resolvers).

Is the synchronization used in the asm bits on the different versions of
arm different from how C11 atomics would look like on those versions of
arm?  Or are we just dealing with different ways to implement acquire
loads, for example?

Follow-Ups:
- Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
  - From: Szabolcs Nagy

References:
- [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
  - From: Szabolcs Nagy
- Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
  - From: Torvald Riegel
- Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
  - From: Szabolcs Nagy
- Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
  - From: Szabolcs Nagy

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]