This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 2/2] New pthread rwlock that is more scalable.

From: Torvald Riegel <triegel at redhat dot com>
To: GLIBC Devel <libc-alpha at sourceware dot org>
Cc: "Carlos O'Donell" <codonell at redhat dot com>, "Kleen, Andi" <andi dot kleen at intel dot com>
Date: Sat, 31 Dec 2016 18:18:58 +0100
Subject: Re: [PATCH 2/2] New pthread rwlock that is more scalable.
Authentication-results: sourceware.org; auth=none
References: <1469655533.19224.27.camel@localhost.localdomain> <1469655868.19224.34.camel@localhost.localdomain>

On Wed, 2016-07-27 at 23:44 +0200, Torvald Riegel wrote:
> This replaces the pthread rwlock with a new implementation that uses a
> more scalable algorithm (primarily through not using a critical section
> anymore to make state changes).  The fast path for rdlock acquisition
> and release is now basically a single atomic read-modify write or CAS
> and a few branches.  See nptl/pthread_rwlock_common.c for details.

To illustrate this, here are a few throughput measurements for rdlock
scalability, which is arguably the primary workload for a rwlock.  Each
thread rdlocks and unlocks the same shared rwlock in a tight loop; the
more iterations per second, the better.  This is basically the worst
case for rdlock because it creates a lot of contention on the internal
synchronization metadata: for the old rwlock, this is the internal
exclusive lock; for the new rwlock, it is the reference counter.
I ran these on my laptop, an i7-5600U.

RHEL7 glibc (old algorithm, no lock elision):
threads=1  iter/s=       30323012  timing_t=12970178479
threads=2  iter/s=        7741364  timing_t=12970177016
threads=3  iter/s=        6409745  timing_t=12970193823
threads=4  iter/s=        7699429  timing_t=12970190703

new rwlock, no lock elision:
threads=1  iter/s=       45691950  timing_t=12969859177
threads=2  iter/s=       12746855  timing_t=12969861887
threads=3  iter/s=        7611965  timing_t=12969892316
threads=4  iter/s=        9540106  timing_t=12994081288

glibc master so old rwlock, with --enable-lock-elision:
threads=1  iter/s=       42149798  timing_t=12970179413
threads=2  iter/s=       81666730  timing_t=12970174752
threads=3  iter/s=      103935069  timing_t=12970181048
threads=4  iter/s=       85825402  timing_t=12970180466

The new rwlock still suffers from contention when having empty reader
critical sections, but significantly less than the old algorithm.
Lock elision obviously helps here because transactions are as small as
possible (ie, no user code).  However, lock elision is not without cost:
single-thread throughput is a little slower than with the new algorithm
(I haven't tested with lock elision on the new algorithm, but I guess
the overhead is not much different).

If adding actual work, the difference in scalability between new and old
algorithm becomes more obvious.  Each thread now performs R random reads
to 1MB of integers in thread-private memory between rdlock sections and
R random reads to 1KB of memory shared by all threads (where the rwlock
instances resides too).  RNG is Marsaglia-XOR.

The new rwlock scalability becomes flat when setting R to about 20 to 25
(because this decreases contention on the rwlock-internal
synchronization metadata).

New rwlock with R=25:
threads=1  iter/s=        8942220  timing_t=12969986360
threads=2  iter/s=       10000033  timing_t=12969991729
threads=3  iter/s=       10544801  timing_t=12969994877
threads=4  iter/s=       10122191  timing_t=12993794742

Old rwlock with R=25:
threads=1  iter/s=        8681897  timing_t=12969978511
threads=2  iter/s=        3803653  timing_t=12969987348
threads=3  iter/s=        4266642  timing_t=12969992764
threads=4  iter/s=        4293744  timing_t=12970267963

When setting R to 60, the old rwlock starts to scale beyond 2 threads:

Old rwlock with R=60:
threads=1  iter/s=        3742926  timing_t=12969952203
threads=2  iter/s=        5654611  timing_t=12969956795
threads=3  iter/s=        4203466  timing_t=12969956780
threads=4  iter/s=        4553678  timing_t=12969957191

New rwlock with R=60:
threads=1  iter/s=        3778624  timing_t=12969951167
threads=2  iter/s=        5704772  timing_t=12969953805
threads=3  iter/s=        6765092  timing_t=12969955128
threads=4  iter/s=        7437057  timing_t=12969985198

Old rwlock with lock elision and R=60:
threads=1  iter/s=        3157614  timing_t=12969969763
threads=2  iter/s=        6032500  timing_t=12969969857
threads=3  iter/s=        7497510  timing_t=12969975510
threads=4  iter/s=        8421999  timing_t=12969977476

As you can see, the new rwlock scales much better beyond 2 threads.  The
numbers depend quite a bit on things we currently don't do anywhere in
glibc effectively: spinning (though not relevant for the new rwlock on
this benchmark) and back-off.  Improving both our mutexes (for the
internal mutex in the old rwlock) and the back-off in some of the CAS
loops in the new rwlock would probably improve scalability for both old
and new rwlock somewhat.  However, even with a highly scalable mutex, a
short critical section will not be faster than a simple atomic operation
such as a simple CAS or atomic increment of a reference counter.
Lock elision is still best beyond 1 thread, but even worse at a single
thread.

Finally, lock elision isn't generally a win.  If we just add 3 writes
(W) to random integers in the 1KB of shared data in the rdlock sections,
then glibc master lock elision performs worse than just the old rwlock:

Old rwlock with lock elision and with R=60 and shared W=3:
threads=1  iter/s=        2942579  timing_t=12969979390
threads=2  iter/s=        3156532  timing_t=12969978902
threads=3  iter/s=        3372049  timing_t=12969977253
threads=4  iter/s=        3389924  timing_t=12969983496

Old rwlock with R=60 and shared W=3:
threads=1  iter/s=        3401302  timing_t=12969985299
threads=2  iter/s=        3857156  timing_t=12969982983
threads=3  iter/s=        4036774  timing_t=12969979487
threads=4  iter/s=        4044128  timing_t=12969979431

The lock elision adaption code should make lock elision not perform
significantly worse than the base algorithm, but apparently this doesn't
work in practice (at least for this workload -- though it was the first
configuration with writes that I tried, so I didn't need to exactly look
for a counter example..).  One could argue that rdlock sections
shouldn't write at all, but nothing disallows a program to issue writes
in a rdlock section, and false sharing can happen easily in practice.

To compare, new rwlock with R=60 and shared W=3:
threads=1  iter/s=        3597365  timing_t=12970014905
threads=2  iter/s=        4924836  timing_t=12969993554
threads=3  iter/s=        5504998  timing_t=12969987053
threads=4  iter/s=        6311467  timing_t=12969983615

Therefore, I think it's the right thing to do to not enable lock elision
for the new rwlock, at least in the sense of not just taking over the
current adaption mechanism / tuning parameters.  We need more
performance data and benchmarks / regression testing before we can
enable elision and be confident to not degrade performance.  Examples as
the one above shouldn't happen (ie, old rwlock with vs. without lock
elision).

Happy New Year!

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]