This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PR19826] fix non-LE TLS in static programs


On Sep 24, 2016, Torvald Riegel <triegel@redhat.com> wrote:

> On Sat, 2016-09-24 at 00:44 -0300, Alexandre Oliva wrote:
>> In this patch, I add a long comment explaining why we can't modify
>> another threads' DTV while initializing its static TLS block during
>> dlopen.

> I appreciate that you added a comment.  Given that you seemed to have
> paged in all of this currently, it would be even better if you could try
> to further improve the documentation of all of this.

As your feedback to the patch shows, it's pointless for me to even try.
My way of reasoning about this stuff doesn't seem to make sense to you.
So why bother?

> (I suppose missing the issue you fix here would have been less likely
> if proper documentation existed before.)

*nod*.  That's why I added the new, longer comments: the comments I'd
added before were not enough to stop even myself from reintroducing the
error, because there was a shallow, more obvious (apparent?  harmless?)
race, and a deeper, less obvious one that's very harmful when it hits.

> Are you aware of any synchronization in the TLS stuff that is not based
> on locks?

Yes, plenty.  Off the top of my head, there's static TLS block
initialization at dlopen vs use by other threads, there was DTV
initialization vs use and resize (fixed by my 1.5yo-patch, broken by my
patch from last week, and fixed again once the patch that reverts the
breakage goes in, because we only mess with a thread's DTV at thread its
start-up and then by the thread itself) but IIRC there are
self-concurrency issues if DTV resizing is interrupted by a signal that
uses GD and starts further resizing) and IIRC there is plenty of
suspicious stuff in slotinfo manipulation vs use (the data structure is
only modified while holding the rtld lock, but it's read while updating
a DTV without any explicit synchronization; I never looked into it
enough to tell what the assumptions were that make it safe at some
point, to tell whether there's any chance they still are); I have a
hunch that there are probably issues with TLS descriptors as well,
because although they're created while holding the rtld lock, their use
is not explicitly guarded either, nor are than the GOT entries relocated
so as to point to them.

IMHO these are all internal implementations, so the C11 memory model for
applications doesn't quite apply to them: it's ok if they make stronger
or weaker assumptions than what the language is supposed to expose to
users of the language.  Just like the *implementation* of mutexes,
locks, condition variables and other such concepts won't necessarily
abide by the rules imposed on users of the language, as long as they
deliver the intended semantics, icluding synchronization and memory
model properties.

> If so, it would be great if you could transform them to using the
> new-style C11 atomics.

I guess it would, wouldn't it?  Too bad I'm probably not.  I just can't
stand even the thought of the never-ending conversations that would
ensue.  So, I'm out of here; the only reason you're even from me is that
I'm responsible enough to at least try to fix the breakage I introduced,
once it was brought to my attention (at a time in which I actually
noticed it; too bad I missed it the first time).

> I've reviewed nscd concurrency recently and converted it to use C11
> atomics, and this process revealed many synchronization bugs such as
> missing barriers.

That shouldn't be surprising.  A lot of that code was written before C11
was even in the foreseeable future.

>> +     Another race, not to be mistaken for the above, has to do with
>> +     the TLS block initialized below.  Although we perform this
>> +     initialization while holding the rtld lock, once dlopen
>> +     completes, other threads might run code that accesses the
>> +     variables in these TLS blocks, without taking the rtld lock or
>> +     any other explicit memory acquire.

> What is a "memory acquire"?

That's a term from long before C standardized atomics, from the 20 years
or so in which people like myself have been doing concurrent programming
on multi-CPU computers without explicit language support, and even on
single-CPU multi-tasked systems.  In dusted books from that age, you may
be able to find tons of information about such things as memory barriers
or fences, some with acquire and release semantics, as well as the
implied acquire and release of all memory that occurs at mutex acquire
and release time, respectively.

There are still some remnants from those days in e.g. Linux docs:
https://www.kernel.org/doc/Documentation/memory-barriers.txt

but a lot of what can be found on the Internet now seems to be more
recent than C and C++ atomics.  Still, the notion of memory acquire and
release from the ancient scrolls is still there, even if a bit in
disguise.  There's no reason for us to limit our language and throw away
all the old books just because the term was reused in a new, narrower
context, is there?

>> +     Our implementation appears to assume that, in order for the other
>> +     threads to access those variables, there will be some form of
>> +     synchronization between the end of dlopen() in one thread and the
>> +     use of the just-loaded modules in others, otherwise the very use
>> +     of the loaded code could be undefined.  */

> I'd reverse that: I find it misleading that you start that there is a
> race condition, when in fact this seems not to be the case because of
> the assumption.  Therefore, I'd mention the assumption first, and then
> derive from that there is no data-race and that uses of the TLS will
> find it initialized.

Unless the assumption fails to meet the definition of data race in some
possibly irrelevant fashion.  The point is that the implementation
doesn't guard itself from certain races, if someone's determined to
create one, but I pose they don't matter.  Let me give you some
examples:

- there's the obvious case of the program with a race: the dlopen caller
sets a global non-atomic variable when dlopen completes, and another
thread busy-waits for the variable to change, and then uses its value to
call a function that accesses the newly-loaded TLS variable.  The race
on the global variable enables the race on the TLS variable, so
undefined behavior has already been invoked by the time the TLS
implementation incurs its own race.

- here's a variant that doesn't involve any other race: the dlopen
caller writes to a pipe a pointer to a function that accesses the TLS
variable, and another thread reads the pointer from the pipe and calls
it -> race?  there's clearly a happens-before order implied by the
write() and read(), but that doesn't imply memory synchronization in the
memory model AFAICT.

- another variant in which there's a clear happens-before order, but no
memory synchronization, so there's possibly a race: the dlopen caller
installs a signal handler that accesses the TLS variable, and then gets
the signal sent to the process itself; another thread is chosen to
handle the signal and accesses the TLS variable -> race?

- generalizing, use any out-of-(memory-)band information-carrying side
effect that one thread could use to communicate to another thread how to
call a function that accesses the TLS variable, both function and
variable brought in by dlopen, and you'll have, technically, a data
race, but IMO a harmless data race.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]