This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.


On Wed, 2017-02-15 at 10:35 +0100, Stefan Liebler wrote:
> On 02/13/2017 09:29 PM, Torvald Riegel wrote:
> > Thanks for working on this.  Detailed comments below.
> >
> > Generally, I think we need to keep working on optimizing this (this can
> > happen in follow-up patches).  For example, we should have
> > microbenchmarks for the exchange vs. CAS choice.
> >
> > Also, some of the tuning knobs such as
> > SPIN_TRYLOCK_USE_CMPXCHG_INSTEAD_OF_XCHG apply to atomics in general and
> > not just the spinlock.  I'd prefer if these where in a state in which we
> > could add them as property that's part of the atomics interface.
> >
> > On Wed, 2017-02-08 at 15:49 +0100, Stefan Liebler wrote:
> >> diff --git a/include/atomic.h b/include/atomic.h
> >> index 7f32640..770db4a 100644
> >> --- a/include/atomic.h
> >>
> >> +++ b/include/atomic.h
> >>
> >> @@ -54,7 +54,7 @@
> >>     and following args.  */
> >>  #define __atomic_val_bysize(pre, post, mem, ...)			      \
> >>    ({									      \
> >> -    __typeof (*mem) __atg1_result;					      \
> >>
> >> +    __typeof ((__typeof (*(mem))) *(mem)) __atg1_result;		      \
> >>
> >>      if (sizeof (*mem) == 1)						      \
> >>        __atg1_result = pre##_8_##post (mem, __VA_ARGS__);		      \
> >>      else if (sizeof (*mem) == 2)					      \
> >> @@ -162,9 +162,9 @@
> >>  /* Store NEWVALUE in *MEM and return the old value.  */
> >>  #ifndef atomic_exchange_acq
> >>  # define atomic_exchange_acq(mem, newvalue) \
> >> -  ({ __typeof (*(mem)) __atg5_oldval;					      \
> >>
> >> +  ({ __typeof ((__typeof (*(mem))) *(mem)) __atg5_oldval;		      \
> >>
> >>       __typeof (mem) __atg5_memp = (mem);				      \
> >> -     __typeof (*(mem)) __atg5_value = (newvalue);			      \
> >>
> >> +     __typeof ((__typeof (*(mem))) *(mem)) __atg5_value = (newvalue);	      \
> >>
> >>  									      \
> >>       do									      \
> >>         __atg5_oldval = *__atg5_memp;					      \
> >> @@ -668,7 +668,7 @@ void __atomic_link_error (void);
> >>
> >>  # ifndef atomic_load_relaxed
> >>  #  define atomic_load_relaxed(mem) \
> >> -   ({ __typeof (*(mem)) __atg100_val;					      \
> >>
> >> +   ({ __typeof ((__typeof (*(mem))) *(mem)) __atg100_val;		      \
> >>
> >>     __asm ("" : "=r" (__atg100_val) : "0" (*(mem)));			      \
> >>     __atg100_val; })
> >>  # endif
> >
> > You could keep these changes, but it's a bit odd because you only apply
> > them for the functions you needed them for.  I think it would be better
> > to just remove the volatile qualification in the caller (ie, cast lock
> > to nonvolatile in pthread_spin_lock etc.
> >
> Yes, I've only applied them for the functions needed in spinlock-code.
> Removing volatile from type pthread_spinlock_t is no option and
> casting the volatile pointer to a non-volatile pointer is undefined.
> See comment from Florian:
> On 12/16/2016 09:12 PM, Florian Weimer wrote:
> > That's undefined:
> >
> > “If an attempt is made to refer to an object defined with a
> > volatile-qualified type through use of an lvalue with
> > non-volatile-qualified type, the behavior is undefined.”
> >
> > But we cannot drop the volatile qualifier from the definition of
> > pthread_spinlock_t because it would change the C++ mangling of
> > pthread_spinlock_t * and similar types.

Generally, I wouldn't agree with Florian's comment.  However, all we are
interested in is the storage behind the volatile-qualified type.  Users
aren't allowed the object through other means than the pthread_spin*
functions, so if we cast everywhere, we'll never have any
volatile-qualified accesses.  I believe none of the architectures we
support makes weaker requirements on alignment for volatile-qualified
than for non-volatile-qualified types, so I can't see any problem in
practice with the cast.


> >>
> >> +  if (__glibc_likely (atomic_exchange_acquire (lock, 1) == 0))
> >>
> >>      return 0;
> >>
> >> +  int val;
> >>
> >>    do
> >>      {
> >>        /* The lock is contended and we need to wait.  Going straight back
> >> @@ -50,20 +53,39 @@ pthread_spin_lock (pthread_spinlock_t *lock)
> >>  	 On the other hand, we do want to update memory state on the local core
> >>  	 once in a while to avoid spinning indefinitely until some event that
> >>  	 will happen to update local memory as a side-effect.  */
> >> -      if (SPIN_LOCK_READS_BETWEEN_CMPXCHG >= 0)
> >>
> >> +
> >>
> >> +#if SPIN_LOCK_READS_BETWEEN_CMPXCHG >= 0
> >>
> >> +      /* Use at most SPIN_LOCK_READS_BETWEEN_CMPXCHG plain reads between the
> >>
> >> +	 atomic compare and exchanges.  */
> >>
> Also changed this comment to:
> /* Use at most SPIN_LOCK_READS_BETWEEN_CMPXCHG relaxed MO reads between
>     the atomic compare and exchanges.  */

Thanks.

> > A while ago we tried hard to remove all code that would fail silently
> > when a macro had a typo in the name, for example.  We still have a few
> > of them left (e.g., in the atomics), but I think this here doesn't
> > warrant an exception.
> >
> Okay. You're right.
> 
> In comment of patch 2/2, you've mentioned a header where an architecture
> shall define those parameters.
> Thus I've added the file nptl/pthread_spin_parameters.h.

I think the stub version should be in sysdeps/ but the wiki is missing
this information (default ENOTSUP location):
https://sourceware.org/glibc/wiki/Style_and_Conventions#Proper_sysdeps_Location

I think we should just get rid of SPIN_LOCK_READS_BETWEEN_CMPXCHG, and
make some choice for SPIN_TRYLOCK_LOAD_AND_TEST_BEFORE_XCHG.  There is
no technical reason for throwing in a CAS every now and then, and so far
we have no evidence that it can improve performance (if that would be
the case, we'd need to think harder about that, because we have
spin-waiting loops elsewhere that don't want to modify the value after
waiting for a change of the value).

With that applied you don't have to solve the header problem in this
patch.

I'll write a separate email with suggestions about how to refactor the
spinlock code project-wide.  Also, some more comments on the tuning
parameters below.

I'm still a bit undecided about how to deal with other arch's
pthread_spin_lock.c files that just set SPIN_LOCK_READS_BETWEEN_CMPXCHG
and then include the generic pthread_spin_lock.c.  Maybe it's best if we
just remove them in this patch of yours.

> >>
> >> +#if SPIN_TRYLOCK_USE_CMPXCHG_INSTEAD_OF_XCHG == 1
> >>
> >> +  /* Load and test the spinlock and only try to lock the spinlock if it is
> >>
> >> +     free.  */
> >>
> >> +  int val = atomic_load_relaxed (lock);
> >>
> >> +  if (__glibc_likely (val == 0 && atomic_exchange_acquire (lock, 1) == 0))
> >
> > I think that's not quite the same as a choice between CAS and exchange.
> Yes. You are right.
> > Doing a load first could be helpful for *both* CAS and exchange.  OTOH,
> > if we assume that trylock will succeed most of the time, then the load
> > is unnecessary and might be more costly (eg, if it causes the state of
> > the cache line to change twice).
> >
> E.g. on s390, there exists no instruction which atomically exchanges the 
> value. Instead a CAS loop is used. The CAS instruction locks the 
> cache-line exclusively whether the value in memory equals the expected
> old-value or not. Therefore the value is loaded and compared before 
> using the CAS instruction.
> If no other CPU has accessed the lock-value, the load will cause
> the state of the cache line to exclusive.
> If the lock is not acquired, the subsequent CAS instruction does not
> need to change the state of the cache line.
> If another CPU has accessed the lock-value e.g. by acquiring the lock,
> the load will cause the state of the cache line either to read-only
> or exclusive. This depends on the other CPU - whether it has already 
> stored a new value or not.
> 
> As this behaviour depends on the architecture, the architecture has to
> decide whether it should load and test the lock-value before the 
> atomic-macro.

I agree that the there are architecture-specific properties of atomic
operations that have a significant effect on performance.  For s390, you
list the following points:
* atomic exchange (I suppose all atomic read-modify-write ops?) are
implemented through CAS.
* CAS always brings the cache line into an exclusive state.

(Is the second point correct? I'm not quite sure I understood you
correctly: for example, you write that a load move a cache line to an
exclusive state, which seems odd).
 
These properties are specific to the architecture, but they are not
specific to the synchronization code (eg, to spinlocks).  Thus, I think
such settings should be in the atomics headers (i.e., in
atomic-machine.h).
This should probably be a separate patch.  I would propose the following
macros (both are bool):
/* Nonzero iff all atomic read-modify-write operations (e.g., atomic
exchange) are implemented using a CAS loop.  */
#define ATOMIC_RMW_USES_CAS 1
/* Nonzero iff CAS always assumes it will store, and thus has cache
performance effects similar to a store (e.g., it always puts the
respective cacheline into an exclusive state, even if the comparison
against the expected value fails).  */
ATOMIC_CAS_ALWAYS_PREPARES_FOR_STORE 1

I'm not sure whether there are architectures for which the second macro
would not be true.  It would be good to investigate this, maybe we don't
need to add this at all.



For the spin lock specifically, we have the following possibilities:

1) If we assume that trylock will most likely succeed in practice:
* We just do an exchange.  The properties above don't matter.

2) If we want to bias towards cases where trylock succeeds, but don't
rule out contention:
* If exchange not a CAS loop, and exchange is faster than CAS, do an
exchange.
* If exchange is a CAS loop, use a weak CAS and not an exchange so we
bail out after the first failed attempt to change the state.

3) If we expect contention to be likely:
* If CAS always brings the cache line into an exclusive state, then load
first.  Then do 2).

Do we have consensus yet on whether we assume 2 or 3 (I don't think 1
buys us anything over 2)?


> >> diff --git a/nptl/pthread_spin_unlock.c b/nptl/pthread_spin_unlock.c
> >> index 5fd73e5..f83b696 100644
> >> --- a/nptl/pthread_spin_unlock.c
> >>
> >> +++ b/nptl/pthread_spin_unlock.c
> >>
> >> @@ -23,7 +23,9 @@
> >>  int
> >>  pthread_spin_unlock (pthread_spinlock_t *lock)
> >>  {
> >> -  atomic_full_barrier ();
> >>
> >> -  *lock = 0;
> >>
> >> +  /* The atomic_store_release synchronizes-with the atomic_exchange_acquire
> >>
> >> +     or atomic_compare_exchange_weak_acquire in pthread_spin_lock /
> >>
> >> +     pthread_spin_trylock.  */
> >>
> >> +  atomic_store_release (lock, 0);
> >>
> >>    return 0;
> >>  }
> >
> > I agree with this change.  However, depending on how one interprets
> > POSIX' memory model, one may expect lock releases to be sequentially
> > consistent.  Nonetheless, IMO, glibc should use only release MO.
> >
> > But we need to announce this change.  Some of us have been considering
> > an additional section in the manual were we specify where we deviate
> > from POSIX; this change might be the first we're listing there (though
> > to be fair, this is a deviation only on some architectures because on
> > others such as powerpc I think, we've been using release MO forever).
> >
> >
> As not all architectures use the generic spinlock implementation, what 
> about a NEWS entry like:
> 
> * The new generic spinlock implementation uses C11 like atomics.
>    The pthread_spin_unlock implementation is now using release
>    memory order instead of sequentially consistent memory order.

That's a good start, but I think we need to be more specific.  We also
don't need to say that we use C11-like atomics because that's an
implementation detail.

I'd add something like this instead:

The synchronization that pthread_spin_unlock performs has been changed
to now be equivalent to a C11 atomic store with release memory order to
the spin lock's memory location.  This ensures correct synchronization
for the spin lock's operations and critical sections protected by a spin
lock.  Previously, several (but not all) architectures used stronger
synchronization (e.g., containing what is often called a full barrier).
This change can improve performance, but may affect odd fringe uses of
spin locks that depend on the previous behavior (e.g., using spin locks
as atomic variables to try to implement Dekker's mutual exclusion
algorithm).

This should highlight the potential change for the weird (ab)uses of
spin locks but also makes it clear that this won't affect correctness of
all the common code.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]