This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 2/2] New pthread rwlock that is more scalable.
- From: Torvald Riegel <triegel at redhat dot com>
- To: GLIBC Devel <libc-alpha at sourceware dot org>
- Cc: "Carlos O'Donell" <codonell at redhat dot com>, "Kleen, Andi" <andi dot kleen at intel dot com>
- Date: Sat, 31 Dec 2016 18:18:58 +0100
- Subject: Re: [PATCH 2/2] New pthread rwlock that is more scalable.
- Authentication-results: sourceware.org; auth=none
- References: <1469655533.19224.27.camel@localhost.localdomain> <1469655868.19224.34.camel@localhost.localdomain>
On Wed, 2016-07-27 at 23:44 +0200, Torvald Riegel wrote:
> This replaces the pthread rwlock with a new implementation that uses a
> more scalable algorithm (primarily through not using a critical section
> anymore to make state changes). The fast path for rdlock acquisition
> and release is now basically a single atomic read-modify write or CAS
> and a few branches. See nptl/pthread_rwlock_common.c for details.
To illustrate this, here are a few throughput measurements for rdlock
scalability, which is arguably the primary workload for a rwlock. Each
thread rdlocks and unlocks the same shared rwlock in a tight loop; the
more iterations per second, the better. This is basically the worst
case for rdlock because it creates a lot of contention on the internal
synchronization metadata: for the old rwlock, this is the internal
exclusive lock; for the new rwlock, it is the reference counter.
I ran these on my laptop, an i7-5600U.
RHEL7 glibc (old algorithm, no lock elision):
threads=1 iter/s= 30323012 timing_t=12970178479
threads=2 iter/s= 7741364 timing_t=12970177016
threads=3 iter/s= 6409745 timing_t=12970193823
threads=4 iter/s= 7699429 timing_t=12970190703
new rwlock, no lock elision:
threads=1 iter/s= 45691950 timing_t=12969859177
threads=2 iter/s= 12746855 timing_t=12969861887
threads=3 iter/s= 7611965 timing_t=12969892316
threads=4 iter/s= 9540106 timing_t=12994081288
glibc master so old rwlock, with --enable-lock-elision:
threads=1 iter/s= 42149798 timing_t=12970179413
threads=2 iter/s= 81666730 timing_t=12970174752
threads=3 iter/s= 103935069 timing_t=12970181048
threads=4 iter/s= 85825402 timing_t=12970180466
The new rwlock still suffers from contention when having empty reader
critical sections, but significantly less than the old algorithm.
Lock elision obviously helps here because transactions are as small as
possible (ie, no user code). However, lock elision is not without cost:
single-thread throughput is a little slower than with the new algorithm
(I haven't tested with lock elision on the new algorithm, but I guess
the overhead is not much different).
If adding actual work, the difference in scalability between new and old
algorithm becomes more obvious. Each thread now performs R random reads
to 1MB of integers in thread-private memory between rdlock sections and
R random reads to 1KB of memory shared by all threads (where the rwlock
instances resides too). RNG is Marsaglia-XOR.
The new rwlock scalability becomes flat when setting R to about 20 to 25
(because this decreases contention on the rwlock-internal
synchronization metadata).
New rwlock with R=25:
threads=1 iter/s= 8942220 timing_t=12969986360
threads=2 iter/s= 10000033 timing_t=12969991729
threads=3 iter/s= 10544801 timing_t=12969994877
threads=4 iter/s= 10122191 timing_t=12993794742
Old rwlock with R=25:
threads=1 iter/s= 8681897 timing_t=12969978511
threads=2 iter/s= 3803653 timing_t=12969987348
threads=3 iter/s= 4266642 timing_t=12969992764
threads=4 iter/s= 4293744 timing_t=12970267963
When setting R to 60, the old rwlock starts to scale beyond 2 threads:
Old rwlock with R=60:
threads=1 iter/s= 3742926 timing_t=12969952203
threads=2 iter/s= 5654611 timing_t=12969956795
threads=3 iter/s= 4203466 timing_t=12969956780
threads=4 iter/s= 4553678 timing_t=12969957191
New rwlock with R=60:
threads=1 iter/s= 3778624 timing_t=12969951167
threads=2 iter/s= 5704772 timing_t=12969953805
threads=3 iter/s= 6765092 timing_t=12969955128
threads=4 iter/s= 7437057 timing_t=12969985198
Old rwlock with lock elision and R=60:
threads=1 iter/s= 3157614 timing_t=12969969763
threads=2 iter/s= 6032500 timing_t=12969969857
threads=3 iter/s= 7497510 timing_t=12969975510
threads=4 iter/s= 8421999 timing_t=12969977476
As you can see, the new rwlock scales much better beyond 2 threads. The
numbers depend quite a bit on things we currently don't do anywhere in
glibc effectively: spinning (though not relevant for the new rwlock on
this benchmark) and back-off. Improving both our mutexes (for the
internal mutex in the old rwlock) and the back-off in some of the CAS
loops in the new rwlock would probably improve scalability for both old
and new rwlock somewhat. However, even with a highly scalable mutex, a
short critical section will not be faster than a simple atomic operation
such as a simple CAS or atomic increment of a reference counter.
Lock elision is still best beyond 1 thread, but even worse at a single
thread.
Finally, lock elision isn't generally a win. If we just add 3 writes
(W) to random integers in the 1KB of shared data in the rdlock sections,
then glibc master lock elision performs worse than just the old rwlock:
Old rwlock with lock elision and with R=60 and shared W=3:
threads=1 iter/s= 2942579 timing_t=12969979390
threads=2 iter/s= 3156532 timing_t=12969978902
threads=3 iter/s= 3372049 timing_t=12969977253
threads=4 iter/s= 3389924 timing_t=12969983496
Old rwlock with R=60 and shared W=3:
threads=1 iter/s= 3401302 timing_t=12969985299
threads=2 iter/s= 3857156 timing_t=12969982983
threads=3 iter/s= 4036774 timing_t=12969979487
threads=4 iter/s= 4044128 timing_t=12969979431
The lock elision adaption code should make lock elision not perform
significantly worse than the base algorithm, but apparently this doesn't
work in practice (at least for this workload -- though it was the first
configuration with writes that I tried, so I didn't need to exactly look
for a counter example..). One could argue that rdlock sections
shouldn't write at all, but nothing disallows a program to issue writes
in a rdlock section, and false sharing can happen easily in practice.
To compare, new rwlock with R=60 and shared W=3:
threads=1 iter/s= 3597365 timing_t=12970014905
threads=2 iter/s= 4924836 timing_t=12969993554
threads=3 iter/s= 5504998 timing_t=12969987053
threads=4 iter/s= 6311467 timing_t=12969983615
Therefore, I think it's the right thing to do to not enable lock elision
for the new rwlock, at least in the sense of not just taking over the
current adaption mechanism / tuning parameters. We need more
performance data and benchmarks / regression testing before we can
enable elision and be confident to not degrade performance. Examples as
the one above shouldn't happen (ie, old rwlock with vs. without lock
elision).
Happy New Year!