This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]


On 10/17/18 4:12 PM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> On 10/15/18 8:57 AM, Florian Weimer wrote:
>>> * Carlos O'Donell:
>>>
>>>> (3) Fence-to-fence sync.
>>>>
>>>> For fence-to-fence synchronization to work we need an acquire and release
>>>> fence, and we have that.
>>>>
>>>> We are missing the atomic read and write of the guard. Please review below.
>>>> Florian mentioned this in his review. He is correct.
>>>>
>>>> And all the problems are back again because you can't do atomic loads of
>>>> the large guards because they are actually the function descriptor structures.
>>>> However, this is just laziness, we used the addr because it was convenient.
>>>> It is no longer convenient. Just add a 'init' field to reloc_result and use
>>>> that as the guard to synchronize the threads against for initialization of
>>>> the results. This should solve the reloc_result problem (ignorning the issues
>>>> hppa and ia64 have with the fdesc updates across multiple threads in _dl_fixup).
>>>
>>> I think due to various external factors, we should go with the
>>> fence-based solution for now, and change it later to something which
>>> uses an acquire/release on the code address later, using proper atomics.
>>
>> Let me clarify.
>>
>> The fence fix as proposed in v3 is wrong for all architectures.
>>
>> We are emulating C/C++ 11 atomics within glibc, and a fence-to-fence sync
>> *requires* an atomic load / store of the guard, you can't use a non-atomic
>> access. The point of the atomic load/store is to ensure you don't have a
>> data race.
> 
> Carlos, I'm sorry, but I think your position is logically inconsistent.

Yes, it *is* logically inconsistent. I agree with you.

However, to *be* logically consistent I'd have to fix all data races in
this code in one go, and I can't, it's too much work.

All I want is for any *changes* to follow C11 semantics, and I think we
can do that without major surgery.

Consider it an ideological flaw that I want everyone to practice following
a consistent memory model and think about these problems in terms of that
memory model, and evaluate patches using that model.

> Formally, you cannot follow the memory model here without a substantial
> rewrite of the code, breaking up the struct fdesc abstraction.  The
> reason is that without blocking synchronization, you still end up with
> two non-atomic writes to the same object, which is a data race, and
> undefined, even if both threads write the same value.

There are two distinct problems here, and each can be handled distinctly.

The first is the problem at hand, that there is a data-dependency issue
with the update of the struct reloc_result structure. We have multiple
threads writing to the reloc_result structure, in general those threads
write the same value (locks are taken _dl_lookup_symbol_x), and while
this is a data race, I don't care about it and we aren't going to fix 
it. The only thing we should do is that a thread that determines the 
reloc_result is initialized should see all the correct value in the 
structure and not a sheared result. That is all that we are fixing
here, call that the "change."

We can follow the memory model far enough to avoid a sheared result
being read out of the struct reloc_result.

We have not fixed the data races that occur when two threads read a
zero addr value and both use non-atomic writes to update reloc_result,
and I don't intend the patch to fix that. I don't require that.

> As far as I can see, POWER is !USE_ATOMIC_COMPILER_BUILTINS, so our
> relaxed MO store is just a regular store, without a compiler barrier.
> That means after all that rewriting, we basically end up with the same
> code and the same formal data race that we would have when we just used
> fences.

That's fine.

The use of the guard+fence-to-fence sync is, from a C11 perspective,
correct. However, I recommend adding a reloc_result->reloc_init and
using that with release/acquire loads.

> This is different for USE_ATOMIC_COMPILER_BUILTINS architectures, where
> we do use actual atomic stores.  But for !USE_ATOMIC_COMPILER_BUILTINS,
> the fence-based approach is as good as we can get, with or without
> breaking the abstractions.

We can do better.

> So as I said, given the constraints we are working under, we should go
> with the solution based on fences, and have that tested on Aarch64 as
> well.

I whole heartedly appreciate a pragmatic approach to these problems, but
I still challenge that we can do better without much more work.

>>> I don't want to see this bug fix blocked by ia64 and hppa.  The proper
>>> fix needs some reshuffling of the macros here, or maybe use an unused
>>> bit in the flags field as an indicator for initialization.
>>
>> The fix for this is straight forward.
>>
>> Add a new initializer field to the reloc_result, it's an internal data
>> structure. It can be as big as we want and we can optimize it later.
>>
>> You don't need to do any big cleanups, but we *do* have to get the
>> synchronization correct.
> 
> See above; I don't think we can get the synchronization formally
> correct, even with any level of cleanups.  In the data race case, we
> would have
> 
>   atomic acquire MO load of initializer field
>   non-atomic writes to various struct fields
>   atomic release MO store to initializer field
> 
> in each thread.  That's still undefined behavior due to the blocking
> stores in the middle.

That's fine. The changes made are correct, even if the whole algorithm
itself is not.

> Let me reiterate: Just because you say our atomics are C11, it doesn't
> make them so.  They are syntactically different, and they are not
> presented to the compiler as atomics for !USE_ATOMIC_COMPILER_BUILTINS.
> I know that you and Torvald didn't consider this a problem in the past,
> but maybe you can reconsider your position?

My position is unchanged.

If I could summarize our positions, I would write:

(1) The "Pragmatic approach"

- Since we don't have C11 atomics, we should just use the fences because
  they fix the data dependency issue, and stop there. We need not go any
  further until we are ready to fix the underlying algorithm of the result
  updates and *then* we can follow C11.

(2) The "Incremental C11 approach"

- Assume we are C11, and take an incremental approach where we fix the
  data dependency issue using correct synchronization primitives, even
  if it doesn't solve all of the data races.

Did I summarize your position accurately?

I prefer (2).

Tulio can have a preference also.

I spoke to Tulio on IRC and he says he has a working v4 in build-many-glibcs.

I figure we'll commit something probably tomorrow for this.

Just for the record I'm attaching my WIP v4 so you can see what I mean about
the solution. Yes, I did make the structure 4 bytes larger, and that will have
a real impact, but it removes the dependency on the size of the function pointer,
and I like that for future maintenance, and we'll likely rewrite the whole struct
when we fix 23790 to get a more optimal layout.

-- 
Cheers,
Carlos.
diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
index 63bbc89776..b115f738a4 100644
--- a/elf/dl-runtime.c
+++ b/elf/dl-runtime.c
@@ -158,6 +158,7 @@ _dl_profile_fixup (
 		   struct link_map *l, ElfW(Word) reloc_arg,
 		   ElfW(Addr) retaddr, void *regs, long int *framesizep)
 {
+  unsigned int reloc_init;
   void (*mcount_fct) (ElfW(Addr), ElfW(Addr)) = _dl_mcount;
 
   if (l->l_reloc_result == NULL)
@@ -183,10 +184,36 @@ _dl_profile_fixup (
   /* This is the address in the array where we store the result of previous
      relocations.  */
   struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
-  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
 
-  DL_FIXUP_VALUE_TYPE value = *resultp;
-  if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
+  DL_FIXUP_VALUE_TYPE value;
+  /* CONCURRENCY NOTES:
+
+     Multiple threads may be calling the same PLT sequence and with
+     LD_AUDIT enabled they will be calling into _dl_profile_fixup to
+     update the reloc_result with the result of the lazy resolution.
+     The reloc_result guard variable is reloc_init, and we use
+     acquire/release loads and store to it to ensure that the results of
+     the structure are consistent with the loaded value of the guard.
+     This does not fix all of the data races that occur when two or more
+     threads read reloc_result->reloc_init with a value of zero and read
+     and write to that reloc_result concurrently.  The expectation is
+     generally that while this is a data race it works because the
+     threads write the same values.  Until the data races are fixed
+     there is a potential for problems to arise from these data races.
+     The reloc result updates should happen in parallel but there should
+     be an atomic RMW which does the final update to the real result
+     entry (see bug 23790).
+
+     The following code uses reloc_init set to 0 indicate if it is the
+     first time this object is being relocated, otherwise 1 which
+     indicates the object has already been relocated.
+
+     Reading/Writing from/to reloc_result->reloc_init must not happen
+     before previous writes to reloc_result complete as they could
+     end-up with an incomplete struct.  */
+  reloc_init = atomic_load_acquire (&reloc_result->reloc_init);
+
+  if (reloc_init == 0)
     {
       /* This is the first time we have to relocate this object.  */
       const ElfW(Sym) *const symtab
@@ -346,8 +373,16 @@ _dl_profile_fixup (
 
       /* Store the result for later runs.  */
       if (__glibc_likely (! GLRO(dl_bind_not)))
-	*resultp = value;
+	{
+	  /* Guarantee all previous writes complete before
+	     resultp (aka. reloc_result->addr) is updated.  See CONCURRENCY
+	     NOTES earlier  */
+	  reloc_result->addr = value;
+	  atomic_store_release (&reloc_result->reloc_init, 1);
+	}
     }
+  else
+    value = reloc_result->addr;
 
   /* By default we do not call the pltexit function.  */
   long int framesize = -1;
@@ -355,7 +390,7 @@ _dl_profile_fixup (
 #ifdef SHARED
   /* Auditing checkpoint: report the PLT entering and allow the
      auditors to change the value.  */
-  if (DL_FIXUP_VALUE_CODE_ADDR (value) != 0 && GLRO(dl_naudit) > 0
+  if (reloc_init != 0 && GLRO(dl_naudit) > 0
       /* Don't do anything if no auditor wants to intercept this call.  */
       && (reloc_result->enterexit & LA_SYMB_NOPLTENTER) == 0)
     {
diff --git a/include/link.h b/include/link.h
index 5924594548..a4895ecb2f 100644
--- a/include/link.h
+++ b/include/link.h
@@ -211,10 +211,20 @@ struct link_map
     /* Collected results of relocation while profiling.  */
     struct reloc_result
     {
+      /* CONCURRENCY NOTE: This is used to guard the concurrent initialization
+	 of the relocation result across multiple threads. See the more
+	 detailed notes in elf/dl-runtime.c.  */
+      unsigned int reloc_init;
       DL_FIXUP_VALUE_TYPE addr;
       struct link_map *bound;
       unsigned int boundndx;
+      /* We use this to store the la_symbind flag results.  */
       uint32_t enterexit;
+      /* The enterexit field holds the plt exit/enter results from the symbol
+	 binding, which leaves the two low bits of the following flag unused.
+	 Low 2-bits of each char holds plt exit/enter tracing information for
+	 each dynamic loader namespace (and because of that we don't want to
+	 use this as the guard variable for initialization).  */
       unsigned int flags;
     } *l_reloc_result;
 
diff --git a/nptl/Makefile b/nptl/Makefile
index be8066524c..ba2ce125c3 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -382,7 +382,8 @@ tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
 	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
 	 tst-oncex3 tst-oncex4
 ifeq ($(build-shared),yes)
-tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
+tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
+	 tst-audit-threads
 tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
 tests-nolibpthread += tst-fini1
 ifeq ($(have-z-execstack),yes)
@@ -394,7 +395,8 @@ modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
 		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
 		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
 		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
-		tst-join7mod tst-compat-forwarder-mod
+		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
+		tst-audit-threads-mod2
 extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
 		   tst-cleanup4aux.o tst-cleanupx4aux.o
 test-extras += tst-cleanup4aux tst-cleanupx4aux
@@ -709,6 +711,11 @@ endif
 
 $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
 
+$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
+$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
+tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so
+LDFLAGS-tst-audit-threads = -Wl,-z,lazy
+
 # The tests here better do not run in parallel
 ifneq ($(filter %tests,$(MAKECMDGOALS)),)
 .NOTPARALLEL:
diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
new file mode 100644
index 0000000000..45f89017d3
--- /dev/null
+++ b/nptl/tst-audit-threads-mod1.c
@@ -0,0 +1,43 @@
+/* Dummy audit library for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <elf.h>
+#include <link.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+
+/* We must use a dummy LD_AUDIT module to force the dynamic loader to
+   *not* update the real PLT, and instead use a cached value for the
+   lazy resolution result. It is the update of that cached value that
+   we are testing for correctness by doing this.  */
+
+volatile int count = 0;
+
+unsigned int
+la_version (unsigned int ver)
+{
+  return 1;
+}
+
+unsigned int
+la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
+{
+  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
+}
diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
new file mode 100644
index 0000000000..f9817dd3dc
--- /dev/null
+++ b/nptl/tst-audit-threads-mod2.c
@@ -0,0 +1,22 @@
+/* Shared object with a huge number of functions for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Define all the retNumN functions in a library.  */
+#define definenum
+#include "tst-audit-threads.h"
diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
new file mode 100644
index 0000000000..d303648c96
--- /dev/null
+++ b/nptl/tst-audit-threads.c
@@ -0,0 +1,97 @@
+/* Test multi-threading using LD_AUDIT.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
+   library with a huge number of functions in order to validate lazy symbol
+   binding with an audit library.  We use one thread per CPU to test that
+   concurrent lazy resolution does not have any defects which would cause
+   the process to fail.  We use an LD_AUDIT library to force the testing of
+   the relocation resolution caching code in the dynamic loader i.e.
+   _dl_runtime_profile and _dl_profile_fixup.  */
+
+#include <support/xthread.h>
+#include <strings.h>
+#include <stdlib.h>
+#include <sys/sysinfo.h>
+
+static int do_test (void);
+
+/* This test usually takes less than 3s to run.  However, there are cases that
+   take up to 30s.  */
+#define TIMEOUT 60
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
+
+/* Declare the functions we are going to call.  */
+#define externnum
+#include "tst-audit-threads.h"
+#undef externnum
+
+int num_threads;
+pthread_barrier_t barrier;
+
+void
+sync_all (int num)
+{
+  pthread_barrier_wait (&barrier);
+}
+
+void
+call_all_ret_nums (void)
+{
+/* Call each function one at a time from all threads.  */
+#define callnum
+#include "tst-audit-threads.h"
+#undef callnum
+}
+
+void *
+thread_main (void *unused)
+{
+  call_all_ret_nums ();
+  return NULL;
+}
+
+#define STR2(X) #X
+#define STR(X) STR2(X)
+
+static int
+do_test (void)
+{
+  int i;
+  pthread_t *threads;
+
+  num_threads = get_nprocs ();
+  if (num_threads <= 1)
+    num_threads = 2;
+
+  /* Used to synchronize all the threads after calling each retNumN.  */
+  xpthread_barrier_init (&barrier, NULL, num_threads);
+
+  threads = (pthread_t *) xcalloc (num_threads, sizeof(pthread_t));
+  for (i = 0; i < num_threads; i++)
+    threads[i] = xpthread_create(NULL, thread_main, NULL);
+
+  for (i = 0; i < num_threads; i++)
+    xpthread_join(threads[i]);
+
+  free (threads);
+
+  return 0;
+}
diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
new file mode 100644
index 0000000000..491d0dcbf0
--- /dev/null
+++ b/nptl/tst-audit-threads.h
@@ -0,0 +1,89 @@
+/* Helper header for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* We use this helper to create a large number of functions, all of
+   which will be resolved lazily and thus have their PLT updated.
+   This is done to provide enough functions that we can statistically
+   observe a thread vs. PLT resolution failure if one exists.  */
+
+#define CONCAT(a, b) a ## b
+#define NUM(x, y) CONCAT (x, y)
+
+#define FUNC10(x)	\
+  FUNC (NUM (x, 0));	\
+  FUNC (NUM (x, 1));	\
+  FUNC (NUM (x, 2));	\
+  FUNC (NUM (x, 3));	\
+  FUNC (NUM (x, 4));	\
+  FUNC (NUM (x, 5));	\
+  FUNC (NUM (x, 6));	\
+  FUNC (NUM (x, 7));	\
+  FUNC (NUM (x, 8));	\
+  FUNC (NUM (x, 9))
+
+#define FUNC100(x)	\
+  FUNC10 (NUM (x, 0));	\
+  FUNC10 (NUM (x, 1));	\
+  FUNC10 (NUM (x, 2));	\
+  FUNC10 (NUM (x, 3));	\
+  FUNC10 (NUM (x, 4));	\
+  FUNC10 (NUM (x, 5));	\
+  FUNC10 (NUM (x, 6));	\
+  FUNC10 (NUM (x, 7));	\
+  FUNC10 (NUM (x, 8));	\
+  FUNC10 (NUM (x, 9))
+
+#define FUNC1000(x)		\
+  FUNC100 (NUM (x, 0));		\
+  FUNC100 (NUM (x, 1));		\
+  FUNC100 (NUM (x, 2));		\
+  FUNC100 (NUM (x, 3));		\
+  FUNC100 (NUM (x, 4));		\
+  FUNC100 (NUM (x, 5));		\
+  FUNC100 (NUM (x, 6));		\
+  FUNC100 (NUM (x, 7));		\
+  FUNC100 (NUM (x, 8));		\
+  FUNC100 (NUM (x, 9))
+
+#define FUNC7000()	\
+  FUNC1000 (1);		\
+  FUNC1000 (2);		\
+  FUNC1000 (3);		\
+  FUNC1000 (4);		\
+  FUNC1000 (5);		\
+  FUNC1000 (6);		\
+  FUNC1000 (7);
+
+#ifdef FUNC
+# undef FUNC
+#endif
+
+#ifdef externnum
+# define FUNC(x) extern int CONCAT (retNum, x) (void)
+#endif
+
+#ifdef definenum
+# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
+#endif
+
+#ifdef callnum
+# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
+#endif
+
+FUNC7000 ();

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]