This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
[RFC] improving spinning in adaptive mutexes

From: Torvald Riegel <triegel at redhat dot com>
To: GLIBC Devel <libc-alpha at sourceware dot org>
Date: Thu, 28 Feb 2013 16:57:03 +0100
Subject: [RFC] improving spinning in adaptive mutexes
Adaptive mutexes are a mix between spinlocks and normal mutexes: they
spin for a while trying to get the lock (in userspace), and only resort
to blocking by using futexes when they can't acquire the lock after the
limited amount of spinning.  This is meant to improve scalability and
per-thread overheads especially for smaller critical sections that are
subject to at least some contention.

I've looked at the current implementation (triggered by a former AMD
employee mentioning that the maximum amount of spinning would be much
too low), and there are indeed a few issues.  I'll summarize those in
what follows, and how we could improve it.  In the long term, this could
also be used to improve the default mutexes, I hope.

Please comment.  Even if you don't care about adaptive mutexes, any
historical background information would be useful to improve the glibc
docs (i.e., for a wiki page or similar about our mutex implementations,
like what we've been recently doing for lock elision).

(TLDR version: There are a few issues: The spinning implementation is
inefficient, the spinning duration adaptation at runtime is probably
flawed, and the fixed maxmimum spinning duration is probably too small.
It might be best to first make the spinning more efficient, and then see
whether the adaptation can be improved, or whether we don't understand
how to do this yet and thus are better of with no attempt at runtime
adaptation.)

=== Background: why we want to spin

When a thread tries to acquire a free lock, both adaptive and normal
(ie, those without any spinning) mutexes do the same.  However, if the
lock is already acquired, normal mutexes signal to the lock holder that
they are about to block (setting the futex var to 2), and then block
using futexes.  This works well when the duration for which the lock is
held is significantly larger than the time required to block and
unblock.

However, when the block/unblock latency is longer than the length of a
critical section (CS), the former can limit scalability of the lock/CS
more than necessary.  Likewise, if unblocking takes longer than
acquiring the lock in userspace with atomic ops, scalability is also
decreased.  (Note that not all parts of lock acquisition/release latency
will necessarily slow down all other threads.)

Second, when we block using futexes, we can suffer from contention on
the futex var and, additionally, on what's used by the kernel to
synchronize the futex.

If we spin on the mutex (ie, try to acquire it without blocking), we
potentially introduce additional contention on the futex var, but in
turn we get a chance at decreasing the mutex acquisition/release
latency.  We also avoid creating contention on the kernel side of the
futex (although that's not necessarily a correct comparison, because the
kernel-side lock/futex implementation probably has different performance
characteristics).

--- Minimum lock acquisition latency

The "minimum latency" is how long it takes for a thread to acquire the
lock, measured from the time at which the lock-holding thread is ready
to release the lock.  Simplified, we have three consecutive phases after
the initial attempt to acquire the mutex:

1) Spinning: 2 cache misses + back-off
- 1 miss at the lock holding thread (assuming lock is contended, so
other threads are spinning already).
- 1 miss at the thread that acquired the lock.  Plus the latency of any
back-off used there to decrease contention.

2) Becoming blocked: minimum futex wake-up latency +/- X
- This is kind of a transition phase.  We do the syscall, the kernel
compares the futex var with the expected value, and if they match we get
put into the set of waiting threads.
- We at least have the latency of the syscall until the point at which
the kernel compares the futex var with the expected value.  This might
be smaller than the minimum futex wake-up latency.
- If the futex var's value matches the expected value, X will be
positive because we need to be put into the set of waiters, which can't
be interrupted I guess.

3) Ready to be unblocked: minimum futex wake-up latency
- Once we're in the set of waiting threads, we just need to be woken up.

Note that both 2) and 3) also suffer at least 2 cache misses; after
blocking, a thread needs to acquire the mutex in userspace.

If the lock-holding thread and the acquiring thread compete for the same
logical CPU (perhaps indirectly via some other dependency), phase 1)
will see a larger latency (i.e., additionally the time it takes until
the lock holder will get scheduled).
One can argue that this is the risk of spinning, but I don't think this
is a sufficient reason to not spin (even in non-adaptive mutexes).  If
there's a need for context switches, this would be performance issue
anyway.  Also, hardware is getting more parallel, so it should get less
likely that threads will have to share a logical CPU.

Overall, this leads to a latency curve that is more or less a step
function, and thus not really simple for automatic tuning.  Especially
when we don't know about the length of CS and the level of contention
(ie, how many threads try to acquire the lock).


=== Issues in the current implementation

--- Inefficient spinning

1) No spinning using loads but with just trylock()
- trylock() doesn't first check whether the lock is free using a load,
which can be beneficial in the case of uncontended locks (based on
related measurements I did a few years ago; I don't have current
numbers)
- If the atomic instruction in trylock() puts the lock's cache line into
exclusive state right away, we cause cache misses with the spinning.

2) No back-off
- We spin in a tight loop, so we will get unnecessary cache misses
whenever someone else modifies the lock's cache line (either if due to
the lock being colocated with application data, or if another thread
beats us to acquiring the lock, for example).

--- Current automatic tuning scheme is probably flawed

In the current tuning scheme, adaptive mutexes spin in a loop for a
number of iterations stored per mutex (__spins).  The number of
iterations is also capped by the MAX_ADAPTIVE_COUNT constant (after
which we resort to blocking).  The scheme tries to make __spins converge
to the number of iterations it took until we acquired the lock (or
MAX_ADAPTIVE_COUNT in case we didn't acquire it during the spinning
phase).  Thus, this more or less measures the (exponential average) time
required to acquire a lock, which depends on the length of CS and the
number of threads trying to acquire it.

However, I think this adjusting the max spinning iterations based on the
wrong measurement:
- If we have short critical sections, the scheme reduces the spinning
count.  But why?  If they are short, we better spin and grab the
benefits.  If the max spinning is too large, it won't matter most of the
time.  However, if it's too short, we don't spin even though we should.
(Think of the acquisition latency as a bell curve; if we cut off at the
average, we're sort of missing half the benefit.)
- If we have long critical sections, the maximum spinning either doesn't
matter in practice, or if it does, then we probably shouldn't spin at
all, so as to not spin without success and then have to block (which
might lead to longer delays than blocking right away if we're unlucky).
- We don't consider the futex wake-up latency.  Perhaps this influenced
the choice of the magic number for MAX_ADAPTIVE_COUNT (see below), but I
wouldn't bet on this.
- We don't distinguish between the length of CS and the acquisition
latency: If many threads try to acquire the lock, yet CS are short, we
might want to continue spinning.

--- Max spinning duration probably too small

MAX_ADAPTIVE_COUNT is fixed to 100.  There's no comment in the code
regarding where this number came from.  The code was added in 2004, when
CAS latency was higher than today, so back then, it might have resulted
in several thousand cycles of spinning at the high end.  But CAS latency
when the line is in the cache got much faster (e.g., see Paul McKenney's
measurements), so this is likely tob be a shorter maximum spinning time
on current hardware.  (For example, think about the two-thread case: the
spinning thread will own the line most of the time.)

I've measured a minimum futex latency of about 180ns when the futex
var's value *doesn't* match; thus, this is just the time it takes the
kernel to tell us that we actually should block.
(Test is a microbenchmark: average over a simple loop with futex_wait
calls running for about 10s, single thread, i7 @ 2.5 GHz).

When a futex_wait call actually blocks and needs to be woken by a
futex_wait in another thread, I see a wake-up latency of about 2900ns.
This is measured with one thread that constantly wakes up all other
waiting threads; the latter just wait in a loop, and this measures how
long a single iteration of this wait loop takes.

The time that spinning takes with MAX_ADAPTIVE_COUNT==100 iterations
(with CAS that never succeeds to acquire the dummy lock) is about
1200ns.  With loads instead of CAS, this takes about 400ns.
(Similar test, for code similar to the adaptive mutex' loop; perhaps the
real loop takes a bit longer due to less inlining).

Thus, the minimum futex wake-up latency seems to be significantly larger
than the time for which we try spinning.  This is not neccessarily a bad
thing, depending on the distribution of the duration of actual CS.  But
it doesn't seem right, because we spin for a shorter time than the
latency hit we take when should have spun a little bit longer.


=== Possible steps to improve this

--- Adaptive mutexes

This list is ordered by what might make most sense to try first:

(1) Improve the spin loop itself: loads instead of just trylock()
In the spin loop, use loads until the lock appears free, then use
trylock.  A relatively simple change, but we should still measure the
impact on several systems and archs to better understand the benefits I
suppose, including in cases such as just two threads (where there is
less contention).
We then probably also need to at least adjust MAX_ADAPTIVE_COUNT.

(1.5) Add more probe points (or similar) to adaptive mutexes?
Perhaps we can derive already quite a bit about mutex usage in real
programs based on the probes we already have in place and any in-kernel
stats, we don't have any spinning-specific probes.  Would those be
necessary / helpful?

(2A) Disable adaptation of max spinning.  Change max spinning count?
This assumes that we don't have a good understanding of how to adapt
this properly (see 2B below) and won't have this in the near future
either.  So, don't do any adaption.  Check that the maximum spinning
count makes sense, and go with that.
Perhaps the max spinning should be derived from the futex wake-up
latency, because it affects the lock acquisition latency curve a lot.

(2B) Improve adaptation of max spinning.  Change max spinning count?
In contrast to 2A, it would be good if we can adapt automatically.  We
probably could hide a lot of the overhead of this in the slow path that
spinning (or back-off) are in any case (ie, when we can't acquire the
lock we wanted to acquire).
However, it seems to be hard to find a scheme that doesn't rely on many
rules of thumb.  We'd need the CS duration distribution (plus #threads,
so perhaps the lock acquisition latency distribution would be better),
not just an average, and we need to have a good understanding of the
cache miss costs (which vary) and the futex wake-up latency (including
all other costs that are caused by actually blocking (and thus perhaps
switching contexts)).  We currently don't have the data to drive this in
a computational science kind of way (see below).

(3) Add back-off
The back-off implementation itself is simple, but tuning it isn't I
think.  You need to trade off the chance of reducing contention against
the chance of introducing additional latency.  Yet some kind of
randomized back-off is widely believed to be an essential ingredient.
Perhaps we can reuse some of the recent Linux kernel work in this area
(http://lwn.net/Articles/531254/).

--- Testing including finding performance regressions in the future

So far my experiments have been with a microbenchmark, which runs loops
with
    lock(); waste_time(A); unlock(); waste_time(F)
on several threads.  waste_time is a simple busy waiting loop, and A and
F are customizable durations crudely calibrated to real time.  No
randomization or similar (eg, make A/F a distribution of some kind).
I've just taken samples for now.

While that can be a useful benchmark (if looking at several A/F/threads
combinations), it's obviously limited.  For example:
- We don't know which A/F/threads combinations are most common in
practice.  So which ones should we optimize for?
- We don't know how common it is for threads competing for a lock to
share a logical CPU.
- No consideration of other memory effects (eg, cache misses in the CS)
- No proper model for the kernel-side futex performance (eg, wake-up
latency)

Ideally, we'd be making tuning decisions on lots more data.  But where
do we get it from?
- How do we extend the coverage (in terms of mutex workloads) without
requiring more manual runs?
- Do we need to measure in real applications, or are benchmarks
sufficient?
- How do we get a continuous stream of measurements so that we're
actually able to see performance regressions?
- What do we classify as an actual performance regression, where's the
threshold?

Do we have any other mutex-related (or pthreads-sync-primitives-related)
benchmarks?

The performance assessment problem is a general issue that we have, I
believe (e.g., it's similar for the math functions, on an abstract
level).  So while it's not mutex- nor pthreads-code-specific, it affects
pthreads code a lot because performance is even harder to measure in a
concurrent setting compared to a sequential one.

--- Experiments so far

I've just sampled a few cases so far, on a single machine.  In a
nutshell, even the current kind of spinning is beneficial when a lock
gets contended.  For example, with A=F=1000 (in ns), 2 threads, adaptive
yields almost twice as much throughput in the microbenchmark as
non-adaptive.  If F >> A*threads, then the difference becomes less
because it's more likely that a lock can be acquired right away.  Also,
spinning with more than 100 iterations (ie, like MAX_ADAPTIVE_COUNT
currently) is beneficial better in this A/F setting.

However, with A=F=0 (so lots of contention on the lock) and 2 threads,
the current spinning is *not* beneficial unless both threads are on the
same core (the test machine has hyperthreading); if the threads are on
different cores, the back-off introduced by actually blocking seems to
be more useful for overall throughput, even though lock acquisition
latency can have higher peaks in this case.

While both these are just examples and I haven't done a larger set of
experiments, I believe they show the that there is potential for
improvement.

=== Related stuff and long-term tasks

I've seen that there were past approaches that tried spinning on the
kernel side of futexes; does anyone know what happened to this work?

Assuming that we can get adaptive mutexes to consistently outperform
nonadaptive mutexes, I think it would make sense to add a conservative
version of spinning to normal mutexes too (at least the non-PI ones).
This has the risk that programs might be using normal mutexes with the
expectation that they will block immediately (e.g., to force a context
switch to another thread sharing the same CPU), and that custom spinning
based on normal mutexes for the blocking path get different performance
characteristics.  However, both seem to be insignificant to me compared
to the likely many uses in which mutexes are just used as-is with the
assumption that they should just work fine (e.g., I'm not sure many
developers would experiment with spinlocks or adaptive mutexes).  Are
there any other reasons that might make adding spinning to normal
mutexes risky?

We don't consider NUMA effects in mutexes, cond vars, rwlocks, etc.  But
it certainly has an effect.  I can think of two rough directions we
could go.
Either we rely on the kernel to do something sensible: Whenever we
suspect NUMA effects to matter (how do we detect this reliably?), block
eagerly so that the futex implementation in the kernel can sort this
out.
Alternatively, we could try to change the mutex implementation etc. to
be more NUMA-friendly.  This would be a big task.  I don't know how many
programs are out there for which the pthread sync primitives are
performance-critical (and which haven't implemented their own primitives
despite that); thus, it's hard to judge how much benefit this would
give.


Torvald
Follow-Ups:
- Re: [RFC] improving spinning in adaptive mutexes
  - From: Chris Metcalf
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]