This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix


On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
> Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
> accesses.
> 
> The original ARM TLSDESC design did not discuss this race that
> arises on systems with weak memory order guarantees
> 
> http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt
> 
> "Storing the final value of the TLS descriptor also needs care: the
> resolver field must be set to its final value after the argument gets
> its final value, such that any if thread attempts to use the
> descriptor before it gets its final value, it still goes to the hold
> function."
> 
> The current glibc code (i386, x86_64, arm, aarch64) is
> 
>   td->arg = ...;
>   td->entry = ...;
> 
> the arm memory model allows store-store reordering, so a barrier is
> needed between these two stores (the code is broken on x86 as well in
> principle: x86 memory model does not help on the c code level, the
> compiler can reorder non-aliasing stores).

Yes, this is not guaranteed to work.  This should thus be fixed for all
the architectures.

> What is missing from the TLSDESC design spec is a description of the
> ordering requirements on the read side: the TLS access sequence
> (generated by the compiler) loads the descriptor function pointer
> (td->entry) and then the argument is loaded (td->arg) in that function.
> These two loads must observe the changes made on the write side in a
> sequentially consistent way.

Not strictly in a sequentially consistent way, I believe.  What you are
probably looking for (I'm not familiar with TLS in detail) is some
invariant like if you read update X, you will also read update Y or a
more recent update to Y.

> The code in asm:
> 
>   ldr x1, [x0]    // load td->entry
>   blr [x0]        // call it
> 
> entryfunc:
>   ldr x0, [x0,#8] // load td->arg
> 
> The branch (blr) does not provide a load-load ordering barrier (so the
> second load may observe an old arg even though the first load observed
> the new entry, this happens with existing aarch64 hardware causing
> invalid memory access due to the incorrect TLS offset).
> 
> Various alternatives were considered to force the load ordering in the
> descriptor function:
> 
> (1) barrier
> 
> entryfunc:
>   dmb ishld
>   ldr x0, [x0,#8]
> 
> (2) address dependency (if the address of the second load depends on the
> result of the first one the ordering is guaranteed):
> 
> entryfunc:
>   ldr x1,[x0]
>   and x1,x1,#8
>   orr x1,x1,#8
>   ldr x0,[x0,x1]

The address dependencies could be useful in terms of performance, but
the disadvantage that I see are that there's currently no useful way to
use them in C code (C11's memory_order_consume is broken).

> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
> loads and stores)
> 
> entryfunc:
>   ldar xzr,[x0]
>   ldr x0,[x0,#8]
> 
> Option (1) is the simplest but slowest (note: this runs at every TLS
> access), options (2) and (3) do one extra load from [x0] (same address
> loads are ordered so it happens-after the load on the call site),
> option (2) clobbers x1, so I think (3) is the best solution (a load
> into the zero register has the same effect as with a normal register,
> but the result is discarded so nothing is clobbered. Note that the
> TLSDESC abi allows the descriptor function to clobber x1, but current
> gcc does not implement this correctly, gcc will be fixed independently,
> the dynamic and undefweak descriptors currently save/restore x1 so only
> static TLS would be affected by this clobbering issue).
> 
> On the write side I used a full barrier (__sync_synchronize) although
> 
>   dmb ishst
> 
> would be enough, but write side is not performance critical.

Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
Fixes to existing code should use the C11 memory model for concurrent
code, especially if the fix is about a concurrency issue.

I agree with Adhemerval that a release MO store seems to be sufficient
in this case.

I haven't scanned through the TLS code to assess how much work it would
be to make it data-race-free (as defined by C11), but it would certainly
be helpful in showing similar issues (e.g., when documenting it) and
preventing compiler optimizations that are legal for non-concurrent
accesses but not what we need for TLS (e.g., speculative loads
introduced by the compiler on x86 in the case you mention above).

Given that you have looked at the code, could you give a rough estimate
of how much churn it would be to make the TLS code data-race-free?

It also looks as if the x86_64 variant of tlsdesc.c is, before your
changes and ignoring some additional comments, very similar to the
aarch64 variant.  Can we get one proper tlsdesc.c (data-race-free and
using the C11 model) that can be used on several archs?  This would
certainly decrease maintenance overhead.

I'm aware this might look like I'm requesting extra work that is not at
the core of what you are trying to fix.  However, moving to portable
concurrent code that is shared across several archs is something we've
been doing elsewhere too, and in those cases ARM was on the benefiting
side (e.g., pthread_once, ongoing work in nscd, ...).  I think overall,
some extra work and clean-up will be a good thing for ARM archs too.

Thanks.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]