This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] New condvar implementation that provides stronger ordering guarantees.


On Thu, Jul 02, 2015 at 03:25:02PM +0200, Torvald Riegel wrote:
> > > > We can't use the old algorithm, which tries to avoid spurious wake-ups,
> > > > anymore because there's no way (AFAICT) to wake in FIFO order from a
> > > > futex (the current Linux implementation may do today, but it's not
> > > > guaranteed). 
> > 
> > And what is performance difference between old algorithm and proposed
> > one?
> 
> First of all, this is a bug fix, so any non-catastrophic performance
> concerns are secondary.
> 
> Regarding your question, which performance aspect are you interested in?
> Roughly, I'd say that the new algorithm should be faster and more
> scalable in the common use cases (e.g., compare the critical sections
> vs. just operating on wseq and ssent).
>
Ok, your original description hinted that previous algorithm was faster
but incorrect.
 
> > > > * The scalability problem is actually at the mutex side; the condvar
> > > > could help (and it tries to with the requeue optimization), but it
> > > > should be the mutex who decides how that is done, and whether it is done
> > > > at all.
> > > > * Forcing all but one waiter into the kernel-side wait queue of the
> > > > mutex prevents/avoids the use of lock elision on the mutex.  Thus, it
> > > > prevents the only cure against the underlying scalability problem
> > > > inherent to condvars.
> > 
> > How exactly that harms lock elision? 
> 
> You serialize the woken-up waiters, and lock elision tries to run
> non-conflicting critical sections in parallel.  Thus, serializing
> up-front prevents lock elision from trying to run them in parallel.
>
Which doesn't help when waiters cause lock to be contended.
 
> > > > * If condvars use short critical sections (ie, hold the mutex just to
> > > > check a binary flag or such), which they should do ideally, then forcing
> > > > all those waiter to proceed serially with kernel-based hand-off (ie,
> > > > futex ops in the mutex' contended state, via the futex wait queues) will
> > > > be less efficient than just letting a scalable mutex implementation take
> > > > care of it.  Our current mutex impl doesn't employ spinning at all, but
> > > > if critical sections are short, spinning can be much better.
> > 
> > That looks too complicate code much, how do you want to pass information
> > do differentiate between signal/broadcast?
> 
> I don't understand your question.  How does it relate to my paragraph
> you replied to?
>
A problem is that mutex doesn't just protect short critical section, for
example its C++ monitor with other method that locks for long time.
 
> > > > * Doing the requeue stuff requires all waiters to always drive the mutex
> > > > into the contended state.  This leads to each waiter having to call
> > > > futex_wake after lock release, even if this wouldn't be necessary.
> > > >
> > 
> > That is most important point. My hypotesis is that mutex will almost
> > always be unlocked for threads after wake.
> 
> Well, that depends on how both the mutex and the condvar are used,
> actually.
> 
I would say that scheduling has more impact than use case as I tried to
write below.

> > Here question is how does
> > wake and scheduler interact. Here bad case would be effective wake that
> > could simultaneously schedule threads to free cores and all would
> > collide. A better deal would be if there would be 1000 cycle delay
> > between different cores as when next thread tried to lock previous
> > thread would be already done.
> 
> I don't understand those sentences.
> 
I did model when threads don't block after broadcast until they release
lock.

Initially threads occupy a k free cores and there is contention. After k
threads acquire lock situation stabilizes as each core runs that thread
until it blocks. Contention is less likely as you need that happen for
two threads in interval smaller than it takes one thread to get lock, do
stuff and release lock. If in initial phase threads are not started
simultaneously but with some delay contention could be lower as thread
completed its job.

> > So how that behaves in practice?
> 
> I don't understand this question.
>
Just wanted to know how often lock takes fast path after broadcast.
 
> > > > PI support is "kind of" included.  There is no internal lock anymore, so
> > > > the thing that Darren proposed the fix for is gone.  So far so good;
> > > > however, we don't requeue, and Darren's paper states that requeue would
> > > > yield better latency in the PI scenario (is this still the case?).
> > > >
> > You have problem that when kernel keeps FIFO api requeue gives you fairness while
> > with waking everything a important thread could be buried by lower
> > priority threads that after each broadcast do something small and wait
> > for broadcast again. 
> 
> If you wake several threads, then the those with highest priority will
> run.

While they will run that doesn't mean that they will win a race. For
example you have 4 core cpu and, 3 low priority threads and one high
priority one. Each will check condition. If true then it will consume 
data causing condition be false. If condition not met he will wait again. 
Then high priority thread has only 1/4 chance to be first to get lock 
to consume data. With requeue he could always do it.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]