This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] New condvar implementation that provides stronger ordering guarantees.
- From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- To: libc-alpha at sourceware dot org
- Date: Tue, 19 May 2015 15:05:10 -0300
- Subject: Re: [PATCH] New condvar implementation that provides stronger ordering guarantees.
- Authentication-results: sourceware.org; auth=none
- References: <1424456307 dot 20941 dot 122 dot camel at triegel dot csb> <1431713889 dot 25070 dot 17 dot camel at triegel dot csb>
Hi,
I am still reading all the code and the synchronization required, but it is
comments are very instructive so far. On nit to make it build with recent
changes:
diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c
index 2106bf6..8a98805 100644
--- a/nptl/pthread_cond_wait.c
+++ b/nptl/pthread_cond_wait.c
@@ -35,7 +35,7 @@
# undef INLINE_VSYSCALL
# define INLINE_VSYSCALL INLINE_SYSCALL
#else
-# include <bits/libc-vdso.h>
+# include <libc-vdso.h>
#endif
struct _condvar_cleanup_buffer
With this change I could check on aarch64 and arm without any regressions.
Also it was about time to remove the x86_64/i386 pthread cond assembly
implementation and make them use the default C implementation (in fact I am
planing to add the same cleanup for the bz12683 fix I am working).
On 15-05-2015 15:18, Torvald Riegel wrote:
> Ping, 3 months later.
>
> I believe Siddhesh has been testing this in rawhide, and at least I
> haven't heard of any breakage. Nonetheless, the new implementation is
> complex, so getting an actual review before this gets checked in would
> be good. At the very least, it would be good if someone could confirm
> that the comments are sufficient and easy to follow; alternatively,
> please complain if you think more detail is needed.
>
> On Fri, 2015-02-20 at 19:18 +0100, Torvald Riegel wrote:
>> TLDR: This is a new implementation for condition variables, required
>> after http://austingroupbugs.net/view.php?id=609 to fix bug 13165. In
>> essence, we need to be stricter in which waiters a signal or broadcast
>> is required to wake up; this couldn't be solved using the old algorithm.
>> ISO C++ made a similar clarification, so this also fixes a bug in
>> current libstdc++, for example.
>> This doesn't contain the existing requeue optimization (see below), and
>> PI support is not worse (except no requeue).
>>
>>
>> Arch maintainers: This adapts each arch's definition of pthread_cond_t.
>> Only x86, x86_64, and hppa have significant arch-specific changes. I'd
>> appreciate review considering we want to stay compatible with old
>> initializers, and we don't want existing padding to alias with bytes
>> that are now used for real fields (see text on pthread_cond_t below for
>> details).
>>
>>
>> And here's the detailed version :)
>>
>> We can't use the old algorithm, which tries to avoid spurious wake-ups,
>> anymore because there's no way (AFAICT) to wake in FIFO order from a
>> futex (the current Linux implementation may do today, but it's not
>> guaranteed). Thus, when we wake, we can't simply let someone grab a
>> signal, but we need to ensure that one of the waiters happening before
>> the signal is woken up -- not just any waiter. This is something the
>> previous algorithm violated (see bug 13165).
>> The problem with having to wake in-order and trying to prevent spurious
>> wake-ups is that one would have to encode the order, which one needs
>> space for (e.g., for separate futexes). But pthread_cond_t is limited
>> in space, and we can't use external space for process-shared condvars.
>>
>> The primary reason for spurious wake-ups with this new algorithm is the
>> following: If we have waiters W1 and W2, and W2 registers itself as a
>> waiter later than W1 does, and if W2 blocks earlier than W1 using a
>> futex, then a signal will wake both W1 and W2. IOW, this is when one
>> waiter races ahead of an earlier one when doing futex_wait. Once they
>> both wait, or if they keep their order, there's no spurious wake-ups.
>>
>> We could avoid more of these spurious wake-ups by maintaining at least
>> two groups of waiters that approximate the waiter order (e.g., have one
>> group that is eligible for wake-up, and drained with priority, and a
>> second that is catch-all and will become the new first group when that
>> is drained). But this would add substantial complexity to the
>> algorithm, and it may be a tight fit into the size of pthread_cond_t we
>> have today.
>>
>>
>> There's another issue specific to condvars: ABA issues on the underlying
>> futexes. Unlike mutexes that have just three states, or semaphores that
>> have no tokens or a limited number of them, the state of a condvar is
>> the *order* of the waiters. With a semaphore, we can grab whenever
>> there is a token, and block in the futex_wait when there is none. With
>> condvars, a waiter must not block if there had been more
>> signals/broadcasts than waiters before it (ie, ordering matters).
>>
>> Futex words in userspace (ie, those memory locations that control
>> whether futex_wait blocks or not) are 32b. Therefore, we can have ABA
>> issues on it, which could lead to lost wake-ups because a waiter simply
>> can't distinguish between no new signals being sent and lots of signals
>> being sent (2^31 in this implementation).
>>
>> It might be unlikely that this occurs, and needs a specific scenario,
>> but I'm not comfortable with just ignoring it -- knowingly. Therefore,
>> this patch avoids the ABA issues by quiescing the condvar before an
>> overflow on the internal counters for the number of waiters /
>> signals-sent happens, and then resets the condvar. This just increases
>> the number of spurious wake-ups while we do the reset on non-PI condvars
>> (but it is a problem for PI; see below). The quiescence handling does
>> add complexity, but it seems not excessive relative to the rest of the
>> algorithm.
>>
>>
>> This algorithm satisfies the equivalent of the strong mutex destruction
>> guarantee. However, unlike mutexes, because of spurious wake-ups being
>> allowed a correct program has to effectively ensure that destruction
>> happens after signals/broadcast calls return. Thus, the destruction
>> requirement in POSIX is not as restrictive as with semaphores, but we
>> still need to take care.
>>
>>
>> If you want to dive into the code, it's best to start with the comments
>> on top of __pthread_cond_wait_common in nptl/pthread_cond_wait.c.
>>
>> One notable difference to the previous implementation is that the new
>> code doesn't use an internal lock anymore. This simplifies the PI
>> implementation (see below), and should speed up things like concurrent
>> signals/broadcasts, and the general hand-off between wait and
>> signal/broadcast.
>>
>> I've merged pthread_cond_timedwait.c into pthread_cond_wait.c because
>> they both share most of the code. __pthread_cond_wait_common is
>> currently an always_inline, but we might also make that a noinline if
>> you're concerned over statically linked programs that use either the
>> timed or the non-timed cond_wait variant.
>>
>> pthread_cond_t is the same on all archs (except on hppa, see below, and
>> m68k which enforces 4-byte alignment of the first int). The new
>> algorithm needs less space (int instead of long long int in the case of
>> three fields), so the old initializers should remain working. The x86
>> version looks fine for me, but I'd appreciate (an)other set(s) of eyes
>> on this aspect. We don't need stronger alignment for the new algorithm.
>>
>> Carlos: the hppa changes are completely untested. They change the
>> pthread-once-like code to C11 atomics, which fixes one missing compiler
>> barrier (acquire load was a plain load). Let me know if you object to
>> these.
>>
>> x86 had custom assembly implementations. Given that this patch fixes a
>> correctness problem, I've just removed them. Additionally, there's no
>> real fast-path in cond_wait unless perhaps if we want to consider just
>> spin for a signal to become available as a fast path; in all other
>> cases, we have to wait, so that's cache misses at least. signal and
>> broadcast could be considered fast paths; the new code doesn't use an
>> internal lock anymore, so they should have become faster (e.g.,
>> cond_signal is just a CAS loop now and a call to futex_wait (that we
>> might avoid too with some more complexity).
>>
>> There are three new tests, cond36-cond28, which are variations of
>> existing tests that frequently drive a condvar into the quiescence state
>> (and thus test quiescence).
>>
>>
>> This condvar doesn't yet use a requeue optimization (ie, on a broadcast,
>> waking just one thread and requeueing all others on the futex of the
>> mutex supplied by the program). I don't think doing the requeue is
>> necessarily the right approach (but I haven't done real measurements
>> yet):
>> * If a program expects to wake many threads at the same time and make
>> that scalable, a condvar isn't great anyway because of how it requires
>> waiters to operate mutually exclusive (due to the mutex usage). Thus, a
>> thundering herd problem is a scalability problem with or without the
>> optimization. Using something like a semaphore might be more
>> appropriate in such a case.
>> * The scalability problem is actually at the mutex side; the condvar
>> could help (and it tries to with the requeue optimization), but it
>> should be the mutex who decides how that is done, and whether it is done
>> at all.
>> * Forcing all but one waiter into the kernel-side wait queue of the
>> mutex prevents/avoids the use of lock elision on the mutex. Thus, it
>> prevents the only cure against the underlying scalability problem
>> inherent to condvars.
>> * If condvars use short critical sections (ie, hold the mutex just to
>> check a binary flag or such), which they should do ideally, then forcing
>> all those waiter to proceed serially with kernel-based hand-off (ie,
>> futex ops in the mutex' contended state, via the futex wait queues) will
>> be less efficient than just letting a scalable mutex implementation take
>> care of it. Our current mutex impl doesn't employ spinning at all, but
>> if critical sections are short, spinning can be much better.
>> * Doing the requeue stuff requires all waiters to always drive the mutex
>> into the contended state. This leads to each waiter having to call
>> futex_wake after lock release, even if this wouldn't be necessary.
>>
>> Therefore, this condvar doesn't do requeue currently. I'd like to get
>> your opinion on this.
>> Once we agree on a way forward, I'd either (1) adapt the condvar to use
>> requeue or (2) adapt the _cond_ variants of the lowlevellock and
>> pthread_mutex_* to not always drive the mutex into the contended state.
>> Here's a sketch of how we could implement requeue (IOW, make sure we
>> don't requeue to the wrong mutex):
>> * Use one bit (NEW_MUTEX_BIT or such) in signals_sent as a flag for
>> whether the mutex associated with the condvar changed. Add proper
>> masking of it, adapt WSEQ_THRESHOLD accordingly, etc.
>> * Let waiters do this:
>> if (mutex != cond->mutex) {
>> atomic_store_relaxed (&cond->mutex, newmutex);
>> atomic_fetch_or_release (&cond->signals_sent, NEW_MUTEX_BIT);
>> }
>> futex_wait(...)
>> * Let broadcast do:
>> s = atomic_fetch_add_acquire (&cond->signals_sent, signals_to_add);
>> if (s & NEW_MUTEX_BIT) /* reset the bit with a CAS, retry; */
>> m = atomic_load_relaxed (&cond->mutex);
>> futex_cmp_requeue (..., s + signals_to_add /* expected value */,
>> ..., m /* expected mutex */
>> This would ensure that if a futex_wait on a new mutex comes in,
>> broadcast will grab the new mutex or futex_cmp_requeue will fail (it
>> will see the signals_sent update because of futex op ordering).
>>
>>
>> PI support is "kind of" included. There is no internal lock anymore, so
>> the thing that Darren proposed the fix for is gone. So far so good;
>> however, we don't requeue, and Darren's paper states that requeue would
>> yield better latency in the PI scenario (is this still the case?).
>>
>> Darren, I'd propose that we figure out how to best adapt this new
>> implementation to do PI. I'm looking forward to your comments.
>>
>> One thing I don't think we'll be able to solve is ensuring PI during
>> quiescence. When doing quiescence, we need for all waiters to go out of
>> the condvar, so confirm their wake-up. I can't see a way of boosting
>> their priorities if they get suspended between releasing the mutex and
>> starting to wait; there's no app lock they still hold, and we can't give
>> wwaiters per-waiter locks linked from the condvar that we could use to
>> boost individual threads because this doesn't work in the process-shared
>> case. Maybe there's something I'm missing (but I though a while about
>> it ;), and maybe there's some PI-futex functionality that I wasn't aware
>> of that we could (ab)use.
>> Thus, the most practical approach seems to be to just not do any PI
>> during quiescence (ie, every 2^31 cond_wait calls). Any alternative
>> suggestions?
>>
>> This problem would go away if we had 64b futexes because then we
>> wouldn't need quiescence anymore (assuming 64b counters avoid ABA).
>>
>>
>> Tested on x86_64-linux and x86-linux.
>>
>>
>> 2015-02-20 Torvald Riegel <triegel@redhat.com>
>>
>> [BZ #13165]
>> * nptl/pthread_cond_broadcast.c (__pthread_cond_broadcast): Rewrite to
>> use new algorithm.
>> * nptl/pthread_cond_destroy.c (__pthread_cond_destroy): Likewise.
>> * nptl/pthread_cond_init.c (__pthread_cond_init): Likewise.
>> * nptl/pthread_cond_signal.c (__pthread_cond_signal): Likewise.
>> * nptl/pthread_cond_wait.c (__pthread_cond_wait): Likewise.
>> (__pthread_cond_timedwait): Move here from pthread_cond_timedwait.c.
>> (__condvar_confirm_wakeup, __condvar_cancel_waiting,
>> __condvar_cleanup_waiting, __condvar_cleanup_quiescence,
>> __pthread_cond_wait_common): New.
>> (__condvar_cleanup): Remove.
>> * npt/pthread_condattr_getclock (pthread_condattr_getclock): Adapt.
>> * npt/pthread_condattr_setclock (pthread_condattr_setclock): Likewise.
>> * nptl/tst-cond1.c: Add comment.
>> * nptl/tst-cond18.c (tf): Add quiescence testing.
>> * nptl/tst-cond20.c (do_test): Likewise.
>> * nptl/tst-cond25.c (do_test_wait): Likewise.
>> * nptl/tst-cond20.c (do_test): Adapt.
>> * nptl/tst-cond26.c: New file.
>> * nptl/tst-cond27.c: Likewise.
>> * nptl/tst-cond28.c: Likewise.
>> * sysdeps/aarch64/nptl/bits/pthreadtypes.h (pthread_cond_t): Adapt
>> structure.
>> * sysdeps/arm/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> * sysdeps/hppa/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> * sysdeps/ia64/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> * sysdeps/m68k/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> * sysdeps/microblaze/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> * sysdeps/mips/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> * sysdeps/nios2/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> * sysdeps/s390/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> * sysdeps/sh/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> * sysdeps/sparc/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> * sysdeps/tile/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> * sysdeps/unix/sysv/linux/alpha/bits/pthreadtypes.h (pthread_cond_t):
>> Likewise.
>> * sysdeps/unix/sysv/linux/powerpc/bits/pthreadtypes.h (pthread_cond_t):
>> Likewise.
>> * sysdeps/x86/bits/pthreadtypes.h (pthread_cond_t): Likewise.
>> * sysdeps/nptl/internaltypes.h (COND_NWAITERS_SHIFT): Remove.
>> (COND_CLOCK_BITS): Adapt.
>> * sysdeps/nptl/pthread.h (PTHREAD_COND_INITIALIZER): Adapt.
>> * sysdeps/unix/sysv/linux/hppa/internaltypes.h (cond_compat_clear,
>> cond_compat_check_and_clear): Adapt.
>> * sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c: Remove file ...
>> * sysdeps/unix/sysv/linux/hppa/pthread_cond_wait.c
>> (__pthread_cond_timedwait): ... and move here.
>> * nptl/DESIGN-condvar.txt: Remove file.
>> * nptl/lowlevelcond.sym: Likewise.
>> * nptl/pthread_cond_timedwait.c: Likewise.
>> * sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c: Likewise.
>> * sysdeps/unix/sysv/linux/i386/i486/pthread_cond_broadcast.S: Likewise.
>> * sysdeps/unix/sysv/linux/i386/i486/pthread_cond_signal.S: Likewise.
>> * sysdeps/unix/sysv/linux/i386/i486/pthread_cond_timedwait.S: Likewise.
>> * sysdeps/unix/sysv/linux/i386/i486/pthread_cond_wait.S: Likewise.
>> * sysdeps/unix/sysv/linux/i386/i586/pthread_cond_broadcast.S: Likewise.
>> * sysdeps/unix/sysv/linux/i386/i586/pthread_cond_signal.S: Likewise.
>> * sysdeps/unix/sysv/linux/i386/i586/pthread_cond_timedwait.S: Likewise.
>> * sysdeps/unix/sysv/linux/i386/i586/pthread_cond_wait.S: Likewise.
>> * sysdeps/unix/sysv/linux/i386/i686/pthread_cond_broadcast.S: Likewise.
>> * sysdeps/unix/sysv/linux/i386/i686/pthread_cond_signal.S: Likewise.
>> * sysdeps/unix/sysv/linux/i386/i686/pthread_cond_timedwait.S: Likewise.
>> * sysdeps/unix/sysv/linux/i386/i686/pthread_cond_wait.S: Likewise.
>> * sysdeps/unix/sysv/linux/x86_64/pthread_cond_broadcast.S: Likewise.
>> * sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S: Likewise.
>> * sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: Likewise.
>> * sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: Likewise.
>>
>
>
>