This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] improving spinning in adaptive mutexes
- From: Jeff Law <law at redhat dot com>
- To: Torvald Riegel <triegel at redhat dot com>
- Cc: GLIBC Devel <libc-alpha at sourceware dot org>
- Date: Fri, 08 Mar 2013 13:08:11 -0700
- Subject: Re: [RFC] improving spinning in adaptive mutexes
- References: <1362067023.581.14702.camel@triegel.csb>
On 02/28/2013 08:57 AM, Torvald Riegel wrote:
Adaptive mutexes are a mix between spinlocks and normal mutexes: they
spin for a while trying to get the lock (in userspace), and only resort
to blocking by using futexes when they can't acquire the lock after the
limited amount of spinning. This is meant to improve scalability and
per-thread overheads especially for smaller critical sections that are
subject to at least some contention.
I thought I read at one time that Solaris didn't to "normal" mutexes --
they were always adaptive. I can't seem to find the reference though.
Regardless, the whole point of adaptive mutexes is to improve
scalability and from what it appears glibc's implementation has some
undesirable problems.
I've looked at the current implementation (triggered by a former AMD
employee mentioning that the maximum amount of spinning would be much
too low), and there are indeed a few issues. I'll summarize those in
what follows, and how we could improve it. In the long term, this could
also be used to improve the default mutexes, I hope.
As a point of reference, didn't that AMD employee claim that the current
spin count was less than the L2 miss penalty or something similar? If
so, doesn't that dramatically reduce the value of an adaptive mutex?
Overall, this leads to a latency curve that is more or less a step
function, and thus not really simple for automatic tuning. Especially
when we don't know about the length of CS and the level of contention
(ie, how many threads try to acquire the lock).
In general, I don't see that we'll ever have a good indication of the
length of a CS and level of contention for the general case.
Even for code where we had that information, we don't have any
reasonable way to provide it (that I'm aware of) to help with selecting
the right primitives from a performance standpoint.
--- Inefficient spinning
1) No spinning using loads but with just trylock()
- trylock() doesn't first check whether the lock is free using a load,
which can be beneficial in the case of uncontended locks (based on
related measurements I did a few years ago; I don't have current
numbers)
- If the atomic instruction in trylock() puts the lock's cache line into
exclusive state right away, we cause cache misses with the spinning.
I'll note that the model where we first see if the lock is free, then
actually trylock is amenable to probing to gather performance data on
lock contention. It's a general structure I was going to suggest for
glibc so that we could put in probes that would give us valuable data
about lock contention.
However, I think this adjusting the max spinning iterations based on the
wrong measurement:
- If we have short critical sections, the scheme reduces the spinning
count. But why? If they are short, we better spin and grab the
benefits. If the max spinning is too large, it won't matter most of the
time. However, if it's too short, we don't spin even though we should.
(Think of the acquisition latency as a bell curve; if we cut off at the
average, we're sort of missing half the benefit.)
Makes sense. If we converge on N, spinning for a slightly longer time
is still better than blocking. It's really a range we want to look at
and the range is probably defined by how expensive it is to block plus
either wakeup time or time to switch in another thread to do useful work.
=== Possible steps to improve this
--- Adaptive mutexes
This list is ordered by what might make most sense to try first:
(1) Improve the spin loop itself: loads instead of just trylock()
In the spin loop, use loads until the lock appears free, then use
trylock. A relatively simple change, but we should still measure the
impact on several systems and archs to better understand the benefits I
suppose, including in cases such as just two threads (where there is
less contention).
We then probably also need to at least adjust MAX_ADAPTIVE_COUNT.
Obviously given what we've heard about tile, this will have to be
controlled by the target somehow. However, it seems like a good thing
to do to me.
(1.5) Add more probe points (or similar) to adaptive mutexes?
Perhaps we can derive already quite a bit about mutex usage in real
programs based on the probes we already have in place and any in-kernel
stats, we don't have any spinning-specific probes. Would those be
necessary / helpful?
Definitely. And much easier to do if you use the spin on testing the
lock, then use trylock once the lock appears free.
(2A) Disable adaptation of max spinning. Change max spinning count?
This assumes that we don't have a good understanding of how to adapt
this properly (see 2B below) and won't have this in the near future
either. So, don't do any adaption. Check that the maximum spinning
count makes sense, and go with that.
Perhaps the max spinning should be derived from the futex wake-up
latency, because it affects the lock acquisition latency curve a lot.
But that seems to lose the value in adaptive mutexes; instead we should
look to fix the spinning adaptation.
It doesn't have to be perfect. It may be as simple as looking at the
sweet spot for spinning as a range and trimming the range based on a
feedback loop.
Obviously we need to fix the max spinning count as well.
(3) Add back-off
The back-off implementation itself is simple, but tuning it isn't I
think. You need to trade off the chance of reducing contention against
the chance of introducing additional latency. Yet some kind of
randomized back-off is widely believed to be an essential ingredient.
Perhaps we can reuse some of the recent Linux kernel work in this area
(http://lwn.net/Articles/531254/).
Randomized backoff is goodness. Hell, that's been known for decades.
Jeff