This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.


On Tue, 2017-03-14 at 16:55 +0100, Stefan Liebler wrote:
> +/* This is equal to 1 if atomic_exchange is implemented by a CAS loop
> +   and 0 if atomic_exchange is implemented with an instruction faster than
> +   CAS.  */

I'd change this to:
"ATOMIC_EXCHANGE_USES_CAS is equal to 1 if atomic_exchange operations
are implemented based on a CAS loop; otherwise, this is 0 and we assume
that the atomic_exchange operations could provide better performance
than a CAS loop."

> +#if ! defined ATOMIC_EXCHANGE_USES_CAS || ! (ATOMIC_EXCHANGE_USES_CAS == 0 \
> +					     || ATOMIC_EXCHANGE_USES_CAS == 1)
> +# error ATOMIC_EXCHANGE_USES_CAS has to be defined to 0 or 1
> +#endif
> +
>  #endif	/* atomic.h */
> +

Extra new line.

> diff --git a/nptl/pthread_spin_lock.c b/nptl/pthread_spin_lock.c
> index 4d03b78..c2fb8e7 100644
> --- a/nptl/pthread_spin_lock.c
> 
> +++ b/nptl/pthread_spin_lock.c
> 
> @@ -19,27 +19,38 @@
>  #include <atomic.h>
>  #include "pthreadP.h"
>  
> -/* A machine-specific version can define SPIN_LOCK_READS_BETWEEN_CMPXCHG
> 
> -  to the number of plain reads that it's optimal to spin on between uses
> 
> -  of atomic_compare_and_exchange_val_acq.  If spinning forever is optimal
> 
> -  then use -1.  If no plain reads here would ever be optimal, use 0.  */
> 
> -#ifndef SPIN_LOCK_READS_BETWEEN_CMPXCHG
> 
> -# warning machine-dependent file should define SPIN_LOCK_READS_BETWEEN_CMPXCHG
> 
> -# define SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> 
> -#endif
> 
> -
> 
>  int
>  pthread_spin_lock (pthread_spinlock_t *lock)
>  {
> +  int val = 0;
> 
> +
> 
>    /* atomic_exchange usually takes less instructions than
>       atomic_compare_and_exchange.  On the other hand,
>       atomic_compare_and_exchange potentially generates less bus traffic
>       when the lock is locked.

I'd remove these sentences, and rather come back to this ...

> -     We assume that the first try mostly will be successful, and we use
> 
> -     atomic_exchange.  For the subsequent tries we use
> 
> -     atomic_compare_and_exchange.  */
> 
> -  if (atomic_exchange_acq (lock, 1) == 0)
> 
> +     We assume that the first try mostly will be successful, thus we use
> 
> +     atomic_exchange if it is not implemented by a CAS loop.  Otherwise

.. here.  Change to:

We assume that the first try mostly will be successful, thus we use
atomic_exchange if it is not implemented by a CAS loop (we also assume
that atomic_exchange can be faster if it succeeds, see
ATOMIC_EXCHANGE_USES_CAS).  Otherwise,

> 
> +     we use a weak CAS and not an exchange so we bail out after the first
> 
> +     failed attempt to change the state.
> 
> +     For the subsequent tries we use atomic_compare_and_exchange after we
> 
> +     observe a not acquired lock.

s/subsequent tries/subsequent attempts,/
s/a not acquired lock/that the lock is not acquired/

> +     See also comment in pthread_spin_trylock.

Also see the comment ...

> 
> +     We use acquire MO to synchronize-with the release MO store in
> 
> +     pthread_spin_unlock, and thus ensure that prior critical sections
> 
> +     happen-before this critical section.  */
> 
> +#if ATOMIC_EXCHANGE_USES_CAS == 0

We make this a binary macro, so #if !... or #if != 0 is fine.

> +  /* Try to acquire the lock with an exchange instruction as this architecture
> 
> +     has such an instruction and we assume it is faster than a CAS.
> 
> +     The acquisition succeeds if the lock had not been acquired before.  */

change to: ... if the lock is not in an acquired state.
(What you are saying is that the lock had (never) been acquired before,
which doesn't make sense.)

> +  if (__glibc_likely (atomic_exchange_acquire (lock, 1) == 0))
> 
> +    return 0;
> 
> +#elif ATOMIC_EXCHANGE_USES_CAS == 1

just #else

> +  /* Try to acquire the lock with a CAS instruction as this architecture
> 
> +     has no exchange instruction.  The acquisition succeeds if the lock is not
> 
> +     acquired.  */
> 
> +  if (__glibc_likely (atomic_compare_exchange_weak_acquire (lock, &val, 1)))
> 
>      return 0;
> +#endif
> 
>  
>    do
>      {
> @@ -47,23 +58,24 @@ pthread_spin_lock (pthread_spinlock_t *lock)
>  	 to cmpxchg is not a good idea on many targets as that will force
>  	 expensive memory synchronizations among processors and penalize other
>  	 running threads.
> -	 On the other hand, we do want to update memory state on the local core
> 
> -	 once in a while to avoid spinning indefinitely until some event that
> 
> -	 will happen to update local memory as a side-effect.  */
> 
> -      if (SPIN_LOCK_READS_BETWEEN_CMPXCHG >= 0)
> 
> +	 There is no technical reason for throwing in a CAS every now and then,
> 
> +	 and so far we have no evidence that it can improve performance.
> 
> +	 If that would be the case, we have to adjust other spin-waiting loops
> 
> +	 elsewhere, too!
> 
> +	 Thus we use relaxed MO reads until we observe the lock to not be
> 
> +	 acquired anymore.  */
> 
> +      do
> 
>  	{
> -	  int wait = SPIN_LOCK_READS_BETWEEN_CMPXCHG;

Please add this comment here (we've been adding it elsewhere too, in
similar cases):
/* TODO Back-off.  */

> +	  atomic_spin_nop ();
> 
>  
> -	  while (*lock != 0 && wait > 0)
> 
> -	    --wait;
> 
> -	}
> 
> -      else
> 
> -	{
> 
> -	  while (*lock != 0)
> 
> -	    ;
> 
> +	  val = atomic_load_relaxed (lock);
> 
>  	}
> +      while (val != 0);
> 
> +
> 
> +      /* We need acquire memory order here for the same reason as mentioned
> 
> +	 for the first try to lock the spinlock.  */
> 
>      }
> -  while (atomic_compare_and_exchange_val_acq (lock, 1, 0) != 0);
> 
> +  while (!atomic_compare_exchange_weak_acquire (lock, &val, 1));
> 
>  
>    return 0;
>  }
> diff --git a/nptl/pthread_spin_trylock.c b/nptl/pthread_spin_trylock.c
> index 593bba3..a3e9e44 100644
> --- a/nptl/pthread_spin_trylock.c
> 
> +++ b/nptl/pthread_spin_trylock.c
> 
> @@ -23,5 +23,51 @@
>  int
>  pthread_spin_trylock (pthread_spinlock_t *lock)
>  {
> -  return atomic_exchange_acq (lock, 1) ? EBUSY : 0;
> 
> +  /* For the spin try lock, we have the following possibilities:
> 
> +
> 
> +     1) If we assume that trylock will most likely succeed in practice:
> 
> +     * We just do an exchange.
> 
> +
> 
> +     2) If we want to bias towards cases where trylock succeeds, but don't
> 
> +     rule out contention:
> 
> +     * If exchange is not implemented by a CAS loop, and exchange is faster
> 
> +     than CAS, do an exchange.
> 
> +     * If exchange is implemented by a CAS loop, use a weak CAS and not an
> 
> +     exchange so we bail out after the first failed attempt to change the state.
> 
> +
> 
> +     3) If we expect contention to be likely:
> 
> +     * If CAS always brings the cache line into an exclusive state even if the
> 
> +     spinlock is already acquired, then load the value first with
> 
> +     atomic_load_relaxed and test if lock is not acquired. Then do 2).
> 
> +
> 
> +     We prefer case 2) as some of the code out there that uses trylock just
> 
> +     uses it to prevent deadlock (e.g., when trying to acquire multiple locks
> 
> +     at once), whereas other code uses spin_trylock to implement their own
> 
> +     back-off and bounded spinning.
> 
> +     Our assumption is that lock acquisitions, including the former use case
> 
> +     for trylock, are supposed to succeed in the common case.
> 
> +     Thus we choose case 2) as it uses the faster exchange instruction if the
> 
> +     architecture has such an instruction.  For those architectures case 2)
> 
> +     is the same as 1).  On the other architectures we are now using one CAS
> 
> +     with zero as expected value instead of a CAS loop to exchange to one.  */

I'd just say that we assume that 2) is the common case, and that this
won't be slower than 1) in the common case.
We'd have to rephrase the other sentences too much to make them easy to
understand, and I don't think they are really necessary.

> +
> 
> +  /* We use acquire MO to synchronize-with the release MO store in
> 
> +     pthread_spin_unlock, and thus ensure that prior critical sections
> 
> +     happen-before this critical section.  */
> 
> +#if ATOMIC_EXCHANGE_USES_CAS == 0
> 
> +  /* Try to acquire the lock with an exchange instruction as this architecture
> 
> +     has such an instruction and we assume it is faster than a CAS.
> 
> +     The acquisition succeeds if the lock had not been acquired before.  */

See above.

> +  if (atomic_exchange_acquire (lock, 1) == 0)
> 
> +    return 0;
> 
> +#elif ATOMIC_EXCHANGE_USES_CAS == 1

See above.

> +  /* Try to acquire the lock with a CAS instruction as this architecture
> 
> +     has no exchange instruction.  The acquisition succeeds if the lock is not
> 
> +     acquired.  */
> 
> +  int val = 0;
> 
> +  if (atomic_compare_exchange_weak_acquire (lock, &val, 1))
> 
> +    return 0;

Here is where we may address the spurious failure issue:

do
  {
    /* Make sure we always expect a lock that is not acquired.  */
    int val = 0;
    if (atomic_compare_exchange_weak_acquire (lock, &val, 1))
      return 0;
  }
/* atomic_compare_exchange_weak_acquire can fail spuriously.  Whereas
C++11 and C11 make it clear that trylock operations can fail spuriously,
POSIX does not explicitly specify this; it only specifies that failing
synchronization operations do not need to have synchronization effects
themselves, but a spurious failure is something that could contradict a
happens-before established earlier (e.g., that we need to observe that
the lock is acquired).  Therefore, we emulate a strong CAS by simply
checking with a relaxed MO load that the lock is really acquired before
returning EBUSY; the additional overhead this may cause is on the slow
path.  */
while (atomic_load_relaxed (lock) == 0);

> +#endif
> 
> +
> 
> +  return EBUSY;
> 
>  }
> diff --git a/sysdeps/aarch64/atomic-machine.h b/sysdeps/aarch64/atomic-machine.h
> index a5d2213..eb59a5b 100644
> --- a/sysdeps/aarch64/atomic-machine.h
> 
> +++ b/sysdeps/aarch64/atomic-machine.h
> 
> @@ -38,6 +38,7 @@ typedef uintmax_t uatomic_max_t;
>  
>  #define __HAVE_64B_ATOMICS 1
>  #define USE_ATOMIC_COMPILER_BUILTINS 1
> +#define ATOMIC_EXCHANGE_USES_CAS 0

I'll ask Szabolcs again about this.  We may need a comment here
explaining the choice.

>  
>  /* Compare and exchange.
>     For all "bool" routines, we return FALSE if exchange succesful.  */
> diff --git a/sysdeps/alpha/atomic-machine.h b/sysdeps/alpha/atomic-machine.h
> index 06e93f2..e55ecdd 100644
> --- a/sysdeps/alpha/atomic-machine.h
> 
> +++ b/sysdeps/alpha/atomic-machine.h
> 
> @@ -44,6 +44,7 @@ typedef uintmax_t uatomic_max_t;
>  
>  #define __HAVE_64B_ATOMICS 1
>  #define USE_ATOMIC_COMPILER_BUILTINS 0
> +#define ATOMIC_EXCHANGE_USES_CAS 0

This should be 1, I believe.

>  
> 
>  #ifdef UP
> diff --git a/sysdeps/arm/atomic-machine.h b/sysdeps/arm/atomic-machine.h
> index eeac7f0..2556438 100644
> --- a/sysdeps/arm/atomic-machine.h
> 
> +++ b/sysdeps/arm/atomic-machine.h
> 
> @@ -35,6 +35,7 @@ typedef uintmax_t uatomic_max_t;
>  
>  #define __HAVE_64B_ATOMICS 0
>  #define USE_ATOMIC_COMPILER_BUILTINS 0
> +#define ATOMIC_EXCHANGE_USES_CAS 0

This should be 1.  Szabolcs?

>  
>  void __arm_link_error (void);
>  
> diff --git a/sysdeps/microblaze/atomic-machine.h b/sysdeps/microblaze/atomic-machine.h
> index dc5309c..3a9b58f 100644
> --- a/sysdeps/microblaze/atomic-machine.h
> 
> +++ b/sysdeps/microblaze/atomic-machine.h
> 
> @@ -37,6 +37,7 @@ typedef uintmax_t uatomic_max_t;
>  
>  #define __HAVE_64B_ATOMICS 0
>  #define USE_ATOMIC_COMPILER_BUILTINS 0
> +#define ATOMIC_EXCHANGE_USES_CAS 0

1, not 0.

Have you been actually looking at these?  The next line in the file is a
pretty obvious hint that this is an LLSC machine, and atomic_exchange
isn't defined anywhere:

>  /* Microblaze does not have byte and halfword forms of load and reserve and
> diff --git a/sysdeps/mips/atomic-machine.h b/sysdeps/mips/atomic-machine.h
> index 54c182b..3d9da0c 100644
> --- a/sysdeps/mips/atomic-machine.h
> 
> +++ b/sysdeps/mips/atomic-machine.h
> 
> @@ -50,6 +50,8 @@ typedef uintmax_t uatomic_max_t;
>  #define __HAVE_64B_ATOMICS 1
>  #endif
>  
> +#define ATOMIC_EXCHANGE_USES_CAS 0
> 
> +

Please ask the MIPS maintainers to review this.

>  /* See the comments in <sys/asm.h> about the use of the sync instruction.  */
>  #ifndef MIPS_SYNC
>  # define MIPS_SYNC	sync
> diff --git a/sysdeps/powerpc/powerpc32/atomic-machine.h b/sysdeps/powerpc/powerpc32/atomic-machine.h
> index 20d5e85..8f4407b 100644
> --- a/sysdeps/powerpc/powerpc32/atomic-machine.h
> 
> +++ b/sysdeps/powerpc/powerpc32/atomic-machine.h
> 
> @@ -35,6 +35,7 @@
>  
>  #define __HAVE_64B_ATOMICS 0
>  #define USE_ATOMIC_COMPILER_BUILTINS 0
> +#define ATOMIC_EXCHANGE_USES_CAS 0

1

>  
>  /*
>   * The 32-bit exchange_bool is different on powerpc64 because the subf
> diff --git a/sysdeps/powerpc/powerpc64/atomic-machine.h b/sysdeps/powerpc/powerpc64/atomic-machine.h
> index 40c308e..9c4a55b 100644
> --- a/sysdeps/powerpc/powerpc64/atomic-machine.h
> 
> +++ b/sysdeps/powerpc/powerpc64/atomic-machine.h
> 
> @@ -35,6 +35,7 @@
>  
>  #define __HAVE_64B_ATOMICS 1
>  #define USE_ATOMIC_COMPILER_BUILTINS 0
> +#define ATOMIC_EXCHANGE_USES_CAS 0

1

>  
>  /* The 32-bit exchange_bool is different on powerpc64 because the subf
>     does signed 64-bit arithmetic while the lwarx is 32-bit unsigned
> diff --git a/sysdeps/sparc/sparc32/atomic-machine.h b/sysdeps/sparc/sparc32/atomic-machine.h
> index acd029e..9b2eb47 100644
> --- a/sysdeps/sparc/sparc32/atomic-machine.h
> 
> +++ b/sysdeps/sparc/sparc32/atomic-machine.h
> 
> @@ -49,6 +49,7 @@ typedef uintmax_t uatomic_max_t;
>  
>  #define __HAVE_64B_ATOMICS 0
>  #define USE_ATOMIC_COMPILER_BUILTINS 0
> +#define ATOMIC_EXCHANGE_USES_CAS 0

I believe this should be 1.

> 
>  /* We have no compare and swap, just test and set.
> diff --git a/sysdeps/sparc/sparc32/sparcv9/atomic-machine.h b/sysdeps/sparc/sparc32/sparcv9/atomic-machine.h
> index 7f7895e..c3486b4 100644
> --- a/sysdeps/sparc/sparc32/sparcv9/atomic-machine.h
> 
> +++ b/sysdeps/sparc/sparc32/sparcv9/atomic-machine.h
> 
> @@ -46,6 +46,7 @@ typedef uintmax_t uatomic_max_t;
>  
>  #define __HAVE_64B_ATOMICS 0
>  #define USE_ATOMIC_COMPILER_BUILTINS 0
> +#define ATOMIC_EXCHANGE_USES_CAS 0

Likewise.

>  
> 
>  #define __arch_compare_and_exchange_val_8_acq(mem, newval, oldval) \
> diff --git a/sysdeps/sparc/sparc64/atomic-machine.h b/sysdeps/sparc/sparc64/atomic-machine.h
> index 44a43ff..21ef009 100644
> --- a/sysdeps/sparc/sparc64/atomic-machine.h
> 
> +++ b/sysdeps/sparc/sparc64/atomic-machine.h
> 
> @@ -46,6 +46,7 @@ typedef uintmax_t uatomic_max_t;
>  
>  #define __HAVE_64B_ATOMICS 1
>  #define USE_ATOMIC_COMPILER_BUILTINS 0
> +#define ATOMIC_EXCHANGE_USES_CAS 0

Likewise.

>  
> 
>  #define __arch_compare_and_exchange_val_8_acq(mem, newval, oldval) \
> diff --git a/sysdeps/tile/tilegx/atomic-machine.h b/sysdeps/tile/tilegx/atomic-machine.h
> index 6345251..e77f670 100644
> --- a/sysdeps/tile/tilegx/atomic-machine.h
> 
> +++ b/sysdeps/tile/tilegx/atomic-machine.h
> 
> @@ -31,6 +31,7 @@
>  #endif
>  
>  #define USE_ATOMIC_COMPILER_BUILTINS 0
> +#define ATOMIC_EXCHANGE_USES_CAS 0

Ask the tile maintainers.  I assume this could be either 1 or 0, because
both go into the kernel.

>  
>  /* Pick appropriate 8- or 4-byte instruction. */
>  #define __atomic_update(mem, v, op)                                     \
> diff --git a/sysdeps/tile/tilepro/atomic-machine.h b/sysdeps/tile/tilepro/atomic-machine.h
> index 33a8b85..45e36de 100644
> --- a/sysdeps/tile/tilepro/atomic-machine.h
> 
> +++ b/sysdeps/tile/tilepro/atomic-machine.h
> 
> @@ -23,6 +23,7 @@
>  
>  #define __HAVE_64B_ATOMICS 0
>  #define USE_ATOMIC_COMPILER_BUILTINS 0
> +#define ATOMIC_EXCHANGE_USES_CAS 0

Likewise.

>  
>  /* 32-bit integer compare-and-exchange. */
>  static __inline __attribute__ ((always_inline))
> diff --git a/sysdeps/unix/sysv/linux/hppa/atomic-machine.h b/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
> index 2cd2235..9adcab7 100644
> --- a/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
> 
> +++ b/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
> 
> @@ -38,6 +38,7 @@ typedef uintmax_t uatomic_max_t;
>  
>  #define __HAVE_64B_ATOMICS 0
>  #define USE_ATOMIC_COMPILER_BUILTINS 0
> +#define ATOMIC_EXCHANGE_USES_CAS 1

Same as for tile.

>  
>  /* prev = *addr;
>     if (prev == old)

Otherwise, this patch looks good to me.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]