This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: pthread condvars and priority inheritance

From: "Rodriguez, Sergio SF" <sergio dot sf dot rodriguez at intel dot com>
To: "triegel at redhat dot com" <triegel at redhat dot com>
Cc: "tglx at linutronix dot de" <tglx at linutronix dot de>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
Date: Sat, 9 Dec 2017 00:36:49 +0000
Subject: Re: pthread condvars and priority inheritance
Authentication-results: sourceware.org; auth=none
References: <1506384583.26641.5.camel@intel.com> <ecc27a37-3795-72e4-12ef-33b38cf17c22@redhat.com> <1506464757.4521.31.camel@intel.com> <1507118951.17451.36.camel@redhat.com> <1507249847.14674.1.camel@intel.com> <1507282963.17451.160.camel@redhat.com> <1507595215.4405.1.camel@intel.com> <1507642894.17451.330.camel@redhat.com> <1507738853.2856.4.camel@intel.com> <1508974760.1789.1.camel@intel.com>

> > Maybe this can be a 2-step approach, first remove the quiescence
> > with
> > version info outside of the 32b futex just for specific real-time
> > application to avoid priority inversion with this trade-off or
> > limitation on the signal count in 32b futex. 64b futex which
> > requires
> > more discussion can be the next step to be added to address the 32b
> > overflow.
> > 
> I did some experiment removing quiescence as I wanted to see the
> effect. I also created a test case that can generate priority
> inversion. I wanted to see if the priority inversion would happen
> when
> there is no quiescence. I see dis-ordered wake-up when there is no
> quiescence. If I understood correctly, the purpose of quiescence is
> 1)
> to make sure waiters in group 1 is empty before re-using the group
> memory (by group switch), and 2) to maintain the ordering of group
> wake-up. The stronger wake-up ordering implemented by current scheme
> is
> achieved not just by group switch but also along with the quiescence.
> We need some kind of information to tell there are still older
> waiters
> that have not consumed the signals and that we should not send a
> signal
> (as the signal would wake a waiter from more recent group before the
> older waiters are woken). I have not found a way to achieve this
> without quiescence.
> 
> Maybe we should explore using priority inheritance on top of current
> scheme, instead removing the source of priority inversion as I had
> attempted to do. I will look at that next.
> 

Hi Torvald,

Just to follow up the previous discussion. I am looking into an option
on adding PI so that we can boost waiter priority in order for the
waiter to notify signaler that they have been woken up so that the
issue of priority inversion caused by the group quiescence blocking can
be avoided. I’d like to get some feedback before I start putting effort
on implementing this proposal.

I intend to keep the group quiescence which is required by the scheme
for maintaining group wake-up which is the purpose of the current
scheme and also to avoid ABA issue. Also removing the quiescence would
cause disordered wake up as described in my previous experiment.

The proposal involves a need for a new futex operation and I plan to
also send a proposal of that to lkml if this looks promising.

For adding PI support, what we are looking is a way to boost the
priority of the waiter thread so that the waiter thread would not be
preempted and delayed to wake the signaler thread. The
pthread_cond_wait() does not hold any lock as the g_refs is just a
futex and not a lock. It seems what we are looking is a way to boost a
thread priority without the thread actually holding a lock and the
priority boosting needs to start from when we are going into the
quiescence blocking for until the last waiter is woken.

Consider a new futex operation which I named it FUTEX_WAIT_WAITERS_PI
for the discussion. Below is a rough high level description.

FUTEX_WAIT_WAITERS_PI

Arguments - uaddr, val

This futex operation would check if there is any pending waiters on the
futex uaddr based on the kernel maintained waiter tree (which would be
a priority-based rbtree). If there is no pending waiters, the operation
will return to user space. If there is still pending waiters, the
kernel will boost the priority of pending waiters to the priority of
the calling thread, if the calling thread has higher priority, before
putting the calling thread to sleep. When the last waiter is being
woken up, kernel would also set up wake queue to wake up the calling
thread of FUTEX_WAIT_WAITERS_PI.

Just to get some idea how it would be used in the pthread_cond_*:

diff --git a/nptl/pthread_cond_common.c b/nptl/pthread_cond_common.c
index ffbbde4106..21403a66df 100644
--- a/nptl/pthread_cond_common.c
+++ b/nptl/pthread_cond_common.c
@@ -409,7 +409,7 @@ __condvar_quiesce_and_switch_g1 (pthread_cond_t
*cond, uint64_t wseq,
          r = atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1);

          if ((r >> 1) > 0)
-           futex_wait_simple (cond->__data.__g_refs + g1, r, private);
+           {
+             unsigned int signals = atomic_load_acquire (cond-
>__data.__g_signals + g1);
+             futex_wait_waiters_pi (cond->__data.__g_signals + g1,
signals, private);
+           }

          /* Reload here so we eventually see the most recent value
even if we
             do not spin.   */
          r = atomic_load_relaxed (cond->__data.__g_refs + g1);
diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c
index 7812b94a3a..691ef5b6f9 100644
--- a/nptl/pthread_cond_wait.c
+++ b/nptl/pthread_cond_wait.c
@@ -148,17 +148,7 @@ __condvar_cancel_waiting (pthread_cond_t *cond,
uint64_t seq, unsigned int g,
 static void
 __condvar_dec_grefs (pthread_cond_t *cond, unsigned int g, int
private)
 {
-  /* Release MO to synchronize-with the acquire load in
-     __condvar_quiesce_and_switch_g1.  */
-  if (atomic_fetch_add_release (cond->__data.__g_refs + g, -2) == 3)
-    {
-      /* Clear the wake-up request flag before waking up.  We do not
need more
-        than relaxed MO and it doesn't matter if we apply this for an
aliased
-        group because we wake all futex waiters right after clearing
the
-        flag.  */
-      atomic_fetch_and_relaxed (cond->__data.__g_refs + g, ~(unsigned
int) 1);
-      futex_wake (cond->__data.__g_refs + g, INT_MAX, private);
-    }
+  atomic_fetch_add_release (cond->__data.__g_refs + g, -2);
 }

The FUTEX_WAIT_WAITERS_PI operation would do just what the g_refs futex
would have done, i.e. putting the signaler thread to sleep when there
is still pending G1 waiters not woken up but without having to use
another futex, g_refs. The priority boosting of the waiter threads
would be supported by the PI part of the futex operation (i.e. RT-mutex 
backend).

The other benefit would be that we may not need to track the pending
waiter counts as we are tracking in g_refs currently as that is
achieved by the proposed FUTEX operation. So as a result, we may be
able to remove the g_refs from the pthread_cond_t structure.
 
I am also looking into the other case of priority inversion caused by
the internal lock used in the pthread_cond_signal() and
pthread_cond_broadcast(). The internal lock is implemented using 2 bits
from the pthread_cond_t structure and does not support PI. One option I
am currently prototyping is to use the existing FUTEX_LOCK_PI and
FUTEX_UNLOCK_PI to address that. This will be similar to what Darren
had previously done for 2.20 glibc condvar to fix a PI gap caused by an
internal lock in that version of condvar.
 
The other case I also hope to address is the thundering herd issue in
current pthread_cond_broadcast(). I think we can change to use the
existing FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI futex
operations. I have this part of the implementation with some initial
tests done and am currently doing more stress test.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]