This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: arch maintainers: RFC on spinlock refactoring


On 2/18/2017 1:23 PM, Torvald Riegel wrote:
We want to refactor the spinlock code (ie, pthread_spin* functions); the
goal is to improve the generic implementation and use it for as many
architectures as possible.  Stefan Liebler has started this work on the
s390 side.

I'd like to have input on some arch-specific questions from some of you.

Roughly, I think we should
* Finish Stefan's first patch that brings the generic spinlock code into
a good shape.
* Add some properties of atomic operations to each arch's
atomic-machine.h; for now, we're considering 1) whether atomic
read-modify-writes are implemented through CAS and 2) whether a CAS
always brings in the cache line in an exclusive state (ie, even when the
CAS would fail).
* Move as many architectures as we can without detailed review to the
generic spinlock.
* Integrate the spinlock code of the remaining archs; this may require
either extending the generic spinlock code (eg, if the arch comes with
more advanced spinning) or improving the atomics of those archs.
* Improve the back-off strategy; use microbenchmarks to justify changes.

[...]

tile:
* Contains custom back-off implementation.  We should discuss this
separately.
* Does CAS always bring in the cache line in exclusive state?  I guess
it doesn't?
* I suppose atomic_exchange can't be replaced with atomic_store?  Should
this be a flag so that this can be included in the generic code?
* What about trylock?  Keep it or use generic?

I'm happy to help think about this stuff for tile.  Sorry for the
delay since I'm having trouble catching up from February break
vacation with the kids :)

The back-off implementation was driven by our benchmarking with 64+
core systems, where we found that bounded exponential backoff was the
best way to ensure rapid overall system progress in the face of
relatively heavy locking load.  Tile's mesh interconnect carries
atomic requests from cores to the L3 cache, which is distributed
per-core.  That tends to make exponential backoff even more important
since otherwise the mesh gives preferential performance to nearer
cores over farther cores, so the farther cores can get starved in the
absence of backoff.

The CAS operation does not bring the cache line in at all; instead,
the operation is performed on the remote piece of the L3 core and the
reply message carries just the old memory value.  This is true for all
the atomics on TILE-Gx (CAS, exchange, fetch-and-add, fetch-and-and,
fetch-and-or, etc).  The architecture does this so that when lots of
atomics are being issued, we don't time invalidating local L2 caches.

I think switching to a generic trylock implementation would be fine;
as you can see, we haven't tried to optimize the tile version.  We
just use an atomic exchange (not pulling in any cache line, as
mentioned above) to try to set the lock state to "1".

One thing that's important to note is that the above is all about
TILE-Gx, our more recent 64-bit processor.  TILEPro, the older 32-bit
processor, had only a single atomic operation, a "test and set one",
which everything else is built on top of using kernel fast syscalls to
provide CAS and various other simple atomic patterns.  I believe this
is similar to the existing sparc32 atomic support.  As a result, on
TILEPro you have to use "atomic exchange" (via the kernel fast
syscall) to implement atomic store, since a plain store can race with
the kernel implementation of atomics.  This isn't true on TILE-Gx,
where an atomic store really can be just a plain store operation.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]