This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS


[I'm dropping Andreas from Cc, since this part of the thread has nothing
to do with the reason I copied him at first]

On Jun  9, 2015, Torvald Riegel <triegel@redhat.com> wrote:

> On Tue, 2015-06-09 at 00:01 -0300, Alexandre Oliva wrote:
>> That's not all we want.  We also want to give the dynamic loader
>> priority in assigning a module to static TLS, while it's loading a set
>> of modules.  HW CAS atomic insns don't give us that AFAIK.

> OK.  That would then be a good point to document.  I'm not yet sure I
> understand how that is consistent with dlopen being required to happen
> before accesses to TLS by other threads.

I think the confusion may arise because of the two roles played by rtld
(s/dlopen/rtld/) in the scene.

One is as the loader of the module that defines the TLS variable.  It's
the one that initializes the TLS initialization block holding the
variable.

Another is as the relocator, that processes relocations referencing the
variable in modules that may have been loaded along with the defining
module (it could be the same module too), or at subsequent dlopen
requests.

When the defining and referencing modules are loaded at once, both
initialization and relocation are performed without releasing the lock.

When the referencing module is dlopened later, the lock is taken again,
and then dlopen might find a relocation that requires a TLS variable to
be in static TLS.  If the variable is already in dynamic TLS, because
some earlier access already made the decision, it returns an error.

We could use CAS to this end both in dlopen and in tls_get_addr, but
then, given concurrent access, we'd not increase the odds of a
successful dlopen.  With the lock, we do so to some extent.  So, it's
not a determinant factor, since it is always possible that the
concurrent access beats the subsequent dlopen and places the variable in
dynamic TLS before dlopen grabs the lock or decides with a CAS, but it
shifts the balance slightly so that dlopen, however long it takes, is
given way by concurrent tls_get_addr.

Now, considering that we're moving towards reducing IE in dlopenable
libs, in favor of TLS Descriptors, maybe it's time to revisit this
balance and go for HW CAS instead to decide between static and dynamic.


Even then, we'd want a lock, it occurs to me now.  The reason is that,
once dlopen determines a variable that must go in static TLS can go in
static TLS, it computes the static TLS offset and proceeds to copy the
TLS initialization block to the the corresponding area of the Static TLS
block of each thread, all while holding the rtld lock.

The lock is what currently stops anyone from stepping in between
dlopen's determination that the variable must and can go in static TLS,
and the ultimate setting of the offset to indicate so.  Should we go
CAS, we'd need some intermediate hold value, or dlopen would have to
retract its decision, or somesuch.

Now, for IE access models, this is not a problem: if they execute an IE
access model, it means the relocation and initialization happens-before
the execution, or that the execution is unordered and therefore
undefined.

As for dynamic access models involving tls_get_addr, this is not
clear-cut: there could be GD relocations processed earlier, but that are
only exercised concurrently with a dlopen that places the variable in
Static TLS.  If we want dlopen to succeed, concurrent dynamic access
models must wait until the static TLS area is initialized by dlopen.
Taking the lock is the way this is accomplished, with the added benefit
of acquiring the static TLS block memory initialized released by dlopen
as it released the lock.

Now, in order for this to work without a lock, we'd need dlopen to set
l_tls_offset *after* completing the initialization of the static TLS
area, and while releasing it, never before.  Heck, we need this even
with my proprosed patch.  However, I don't see that dlopen behaves this
way: it appears to set l_tls_offset first, and then use it to initialize
each thread's static TLS area.  Therefore, my current patch is broken,
and the code in master is correct as far as dynamic access to TLS
variables in static TLS goes: the lock is unfortunately necessary under
the current design.

I withdraw the patch.


> You said there are no other dependencies, and we just want to reach
> consensus on the final value of l_tls_offset.  So, from the
> perspective of just this synchronization state machine you gave,
> there's no other data that's relevant.  Now you say there is other
> stuff.  What's true?

The latter.  When I wrote that, I was convinced we'd covered all the
cases of initialization performed by dlopen because I had had failed to
take into account the subsequent cross-thread static TLS area
initialization by a subsequent dlopen that assigns a previously-loaded
TLS variable to static TLS.


It looks like we could still avoid the lock in tls_get_addr by
reorganizing dlopen to choose the offset, initialize and release the
static blocks, and only then CAS the offset, failing if l_tls_offset was
changed along the way.

> Maybe you misunderstood, so let me rephrase it.  When, such as in this
> case, something deviates from what's typical, it's worth pointing this
> out.

I agree.  But does it?  GNU libc has lots of occurrences of this
pattern, so this is hardly a deviation from what's typical.

> You can put the code comments somewhere in the code, for example at one
> of the functions taking part, or at the decl of the variable, or
> somewhere else.  And then reference it.

*nod*

> I thought we had discussed this sufficiently before

We had, but we never reached agreement, and I doubt we ever will on this
point.

> Remember the loads on tile that got mentioned in a previous discussion
> we had?

'fraid I don't.  Reference?

>> - reloads would not be a problem for the pattern used in the second
>> version of the patch

> Yeah, one could speculate about what a compiler may or may not do in
> this *specific* piece of code.  But that's not the level of abstraction
> we want to use as base for our reasoning.

What is clear to me is that our reasonings and thought frameworks are so
different that your personal preferences don't apply to my way of
thinking, and vice versa.  Why is why I'm ok with answering questions
you may have about synchronization patterns I'm familiar with in GNU
libc, but not at all ok with writing the documentation.  From our
discussions so far, I am sure any such documentation I write will be
regarded as useless for you.  Because you and I think in different
terms, different primitives, different abstraction layers.  I'd build
the reasoning from my perspective, and it wouldn't match yours.  And
vice-versa.

> For example, assume the compiler is aware of the code for
> "__rtld_lock_lock_recursive (GL(dl_load_lock))" and knows that it won't
> access l_tls_offset in some way (which it doesn't).   You access
> l_tls_offset inside and out of a critical section, so by
> data-race-freedom, there must not be concurrent write accesses.  So it
> does not actually have to reload l_tls_offset inside of the critical
> section, but use the value of your initial load.  This correct and
> reasonable-for-C11 compiler optimization breaks your synchronization.

Correct and reasonable-for-C11 except for the bold and bald assumption
that the a lock operation is not a global acquire.  The compiler is
forbidden from removing the second load because of the existence of the
lock.  Now, that requirement comes from POSIX, not from the C standard.

>> - speculative stores that could affect this pattern are not permitted by
>> the relevant standards AFAICT

> The standard permits to speculatively store a value, if the target is
> not volatile and the speculative store does not create a data race.

*if* the abstract machine would have stored something in the variable to
begin with.  In the case at hand, it wouldn't, so, no speculative
writes.

> No, this is simply not true in general.  You can argue about likelihood
> in this *particular* case, but then you're doing just that.

Indeed, that's just what I'm doing.

> If you think it's unlikely compilers will try to optimize, have a look
> at this:
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html

>> Since Carlos and Siddhesh took this over, I'll leave it for them too.

> So you will stop working on this?

Once Carlos wrote he and Siddhesh would take care of the issue, I did.

It doesn't mean I'm not willing to share the knowledge I got from
studying the intricacies of TLS synchronization any more.

Now that the patch is withdrawn because it is broken, we can even focus
on more productive questions and answers about it.  But this only makes
sense (as opposed to being a waste of time) if there's a commitment to
turn the conversation into proper documentation.  Do we have that from
anyone?

> If you intend to work on patches affecting synchronization in the
> future, please remember the feedback you got in this patch.

I'll rather try to avoid it.  It's just too hard for you and I to
communicate in this field, and I assume it's as frustrating for you as
it is for me :-(

>> We'd already determined the release fence was needed and taken care of
>> by other means.

> Huh?

rtld that loads and defines the variable and then releases the rtld lock
(and thus the memory) happens-before any well-defined use of the
variable (without some happens before, how could it safely get ahold of
the relocations used in the TLS access model?), even if it doesn't take
the rtld lock, as covered in other messages upthread.  (That's a case
not to be confused with the subsequent dlopen that processes relocations
referencing the variable and assigns it to static TLS, described in this
message.)

>> > Also note that an unlock operation on a lock is *not* guaranteed to have
>> > the same effects as an atomic_thread_fence_release.
>> 
>> *nod*
>> 
>> > The hardware may or may not treat unlocks (ie,
>> > release stores on a *particular* memory location)
>> 
>> It is my understanding that, per POSIX, unlocks ought to release all
>> prior stores made by the thread, and locks must acquire all
>> previously-released stores by any thread.  I.e., they don't apply to
>> single memory locations.  Locks don't even know what particular memory
>> locations they guard, if any.  So I can't make sense of what distinction
>> you intend to point out above.

> We implement POSIX for users of glibc, but we do not implement on top of
> POSIX inside of glibc

Is this documented anywhere?  If we're breaking well-set expectations
WRT locks, we'd better document that.  Though I wouldn't recommend
deviating so much from well-established practice.  It would be a huge
burden for anyone who might consider joining our community.

> (And you know that we don't even
> implement precisely the POSIX semantics with our POSIX locks.)

Yeah, and I do find that unfortunate.

> In C11, there's a distinction between a release-MO fence and a mutex
> unlock operation (i.e., a release-MO store).
> Does that answer your question?

I don't think I asked a question, but no, it doesn't help me understand
what you meant by "unlocks (ie release stores on a particular memory
location)".  In my understanding, unlocks are not about particular
memory locations, so it comes across as a contradiction and your attempt
to answer some unstated underlying question does nothing to clarify it.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]