This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix



On 23/04/15 22:46, Torvald Riegel wrote:
> On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote:
>> On 22/04/15 17:08, Torvald Riegel wrote:
>>> On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
>>>> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
>>>> loads and stores)
>>>>
>>>> entryfunc:
>>>>   ldar xzr,[x0]
>>>>   ldr x0,[x0,#8]
>>>>
>>>> Option (1) is the simplest but slowest (note: this runs at every TLS
>>>> access), options (2) and (3) do one extra load from [x0] (same address
>>>> loads are ordered so it happens-after the load on the call site),
>>>> option (2) clobbers x1, so I think (3) is the best solution (a load
>>>> into the zero register has the same effect as with a normal register,
>>>> but the result is discarded so nothing is clobbered. Note that the
>>>> TLSDESC abi allows the descriptor function to clobber x1, but current
>>>> gcc does not implement this correctly, gcc will be fixed independently,
>>>> the dynamic and undefweak descriptors currently save/restore x1 so only
>>>> static TLS would be affected by this clobbering issue).
>>>>
>>>> On the write side I used a full barrier (__sync_synchronize) although
>>>>
>>>>   dmb ishst
>>>>
>>>> would be enough, but write side is not performance critical.
>>>
>>> Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
>>> Fixes to existing code should use the C11 memory model for concurrent
>>> code, especially if the fix is about a concurrency issue.
>>>
>>> I agree with Adhemerval that a release MO store seems to be sufficient
>>> in this case.
>>>
>>
>> yes, the write side could use c11 atomics and i agree that
>> store-release is enough, but i'm not sure how much one can
>> rely on c11 memory model when interacting with asm code.
>> (so i thought a full barrier is the easiest to reason about)
> 
> IIUC, the asm that you need to interact with is generated by the
> compiler.  The C11 atomics implementation (that is, the asm / HW

to clarify:

// this bit is generated by the compiler for every tls access
// (the exact sequence is different and depends on tiny/small/large
// memory model, it is part of the abi so linkers can do relaxations)
// assume x0 is a pointer to the tls descriptor (td) in the GOT.
// (2 consecutive GOT entries are used: td->entry and td->arg)
// the entry function returns an offset (so tls data is at tp+off)
// note: no barriers are used here

  mrs x2, tpidr_el0 // load the thread pointer
  ldr x1, [x0]      // load td->entry
  blr x1            // call it, returns off in x0
  ldr x0, [x2, x0]  // access tls at tp+off

// this is a function implemented by the libc where td->entry points
// currently the options are
//   _dl_tlsdesc_return,
//   _dl_tlsdesc_undefweak,
//   _dl_tlsdesc_dynamic
// originally td->entry points to a trampoline that calls into
// _dl_tlsdesc_resovle_rela which does the lazy relocation and sets
// td->entry to _dl_tlsdesc_hold temporarily.
entryfunc:
  ldr x0, [x0,#8] // load td->arg

>> i can update the code to use glibc atomics, but i will
>> have to look into how those work (are there type issues
>> or are they generic?)
> 
> There are basically generic, and internally use either the __atomic
> builtins for newer compilers and archs that provide them, or (fall back
> to) the previous implementation based on __sync-builtins or custom asm.
> One constraint is that we don't support sizes of atomic access that
> aren't actually provided on the particular arch (either through HW
> instructions, or with kernel helpers).  So, for example, atomic access
> to a naturally aligned pointer-sized type can be expected to work.

ok

>> i think td->entry should only be accessed with atomics once
>> the loader finished loading the dso with it.
>> (i assume during dso loading concurrent access is not possible)
> 
> If that's some kind of initialization you have in mind, using nonatomics
> accesses may be fine.  We don't have strict rules around that currently.
> But if it's just a few lines of code, then simply using relaxed-MO
> atomic accesses could be fine as well.
> 

at load time the td may be set up and then it is not
written later, but with lazy binding it will be written
latter

care should be taken because the compiler generated
tls access code does not use atomics and can run
in parallel to lazy initialization.

so i guess any td->entry access in the lazy init code
path should use atomics for c11 correctness

>> (the particular case i'm trying to fix is hard because
>> the access to td->entry is generated by the compiler, so
>> it has to be worked around inside the entry function.
>> i think x86 does not have this issue)
> 
> Do you mean that you want to affect the compiler-generated code after it
> has been generated?
> 

i'm fixing the tlsdesc entry functions in libc


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]