This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.

From: Stefan Liebler <stli at linux dot vnet dot ibm dot com>
To: libc-alpha at sourceware dot org
Date: Mon, 20 Feb 2017 13:15:00 +0100
Subject: Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.
Authentication-results: sourceware.org; auth=none
References: <1481905917-15654-1-git-send-email-stli@linux.vnet.ibm.com> <5857CF10.1060100@arm.com> <628f6311-239c-5eea-572c-c2acae6fcbee@linux.vnet.ibm.com> <1487017743.16322.80.camel@redhat.com> <60a34645-17e4-6693-1343-03c55b0c47ad@linux.vnet.ibm.com> <1487437038.20203.68.camel@redhat.com>

On 02/18/2017 05:57 PM, Torvald Riegel wrote:

On Wed, 2017-02-15 at 10:35 +0100, Stefan Liebler wrote:

On 02/13/2017 09:29 PM, Torvald Riegel wrote:

Thanks for working on this.  Detailed comments below.

Generally, I think we need to keep working on optimizing this (this can
happen in follow-up patches).  For example, we should have
microbenchmarks for the exchange vs. CAS choice.

Also, some of the tuning knobs such as
SPIN_TRYLOCK_USE_CMPXCHG_INSTEAD_OF_XCHG apply to atomics in general and
not just the spinlock.  I'd prefer if these where in a state in which we
could add them as property that's part of the atomics interface.

On Wed, 2017-02-08 at 15:49 +0100, Stefan Liebler wrote:

diff --git a/include/atomic.h b/include/atomic.h
index 7f32640..770db4a 100644
--- a/include/atomic.h

+++ b/include/atomic.h

@@ -54,7 +54,7 @@
    and following args.  */
 #define __atomic_val_bysize(pre, post, mem, ...)			      \
   ({									      \
-    __typeof (*mem) __atg1_result;					      \

+    __typeof ((__typeof (*(mem))) *(mem)) __atg1_result;		      \

     if (sizeof (*mem) == 1)						      \
       __atg1_result = pre##_8_##post (mem, __VA_ARGS__);		      \
     else if (sizeof (*mem) == 2)					      \
@@ -162,9 +162,9 @@
 /* Store NEWVALUE in *MEM and return the old value.  */
 #ifndef atomic_exchange_acq
 # define atomic_exchange_acq(mem, newvalue) \
-  ({ __typeof (*(mem)) __atg5_oldval;					      \

+  ({ __typeof ((__typeof (*(mem))) *(mem)) __atg5_oldval;		      \

      __typeof (mem) __atg5_memp = (mem);				      \
-     __typeof (*(mem)) __atg5_value = (newvalue);			      \

+     __typeof ((__typeof (*(mem))) *(mem)) __atg5_value = (newvalue);	      \

 									      \
      do									      \
        __atg5_oldval = *__atg5_memp;					      \
@@ -668,7 +668,7 @@ void __atomic_link_error (void);

 # ifndef atomic_load_relaxed
 #  define atomic_load_relaxed(mem) \
-   ({ __typeof (*(mem)) __atg100_val;					      \

+   ({ __typeof ((__typeof (*(mem))) *(mem)) __atg100_val;		      \

    __asm ("" : "=r" (__atg100_val) : "0" (*(mem)));			      \
    __atg100_val; })
 # endif


You could keep these changes, but it's a bit odd because you only apply
them for the functions you needed them for.  I think it would be better
to just remove the volatile qualification in the caller (ie, cast lock
to nonvolatile in pthread_spin_lock etc.

Yes, I've only applied them for the functions needed in spinlock-code.
Removing volatile from type pthread_spinlock_t is no option and
casting the volatile pointer to a non-volatile pointer is undefined.
See comment from Florian:
On 12/16/2016 09:12 PM, Florian Weimer wrote:

That's undefined:

“If an attempt is made to refer to an object defined with a
volatile-qualified type through use of an lvalue with
non-volatile-qualified type, the behavior is undefined.”

But we cannot drop the volatile qualifier from the definition of
pthread_spinlock_t because it would change the C++ mangling of
pthread_spinlock_t * and similar types.


Generally, I wouldn't agree with Florian's comment.  However, all we are
interested in is the storage behind the volatile-qualified type.  Users
aren't allowed the object through other means than the pthread_spin*
functions, so if we cast everywhere, we'll never have any
volatile-qualified accesses.  I believe none of the architectures we
support makes weaker requirements on alignment for volatile-qualified
than for non-volatile-qualified types, so I can't see any problem in
practice with the cast.


+  if (__glibc_likely (atomic_exchange_acquire (lock, 1) == 0))

     return 0;

+  int val;

   do
     {
       /* The lock is contended and we need to wait.  Going straight back
@@ -50,20 +53,39 @@ pthread_spin_lock (pthread_spinlock_t *lock)
 	 On the other hand, we do want to update memory state on the local core
 	 once in a while to avoid spinning indefinitely until some event that
 	 will happen to update local memory as a side-effect.  */
-      if (SPIN_LOCK_READS_BETWEEN_CMPXCHG >= 0)

+

+#if SPIN_LOCK_READS_BETWEEN_CMPXCHG >= 0

+      /* Use at most SPIN_LOCK_READS_BETWEEN_CMPXCHG plain reads between the

+	 atomic compare and exchanges.  */

Also changed this comment to:
/* Use at most SPIN_LOCK_READS_BETWEEN_CMPXCHG relaxed MO reads between
    the atomic compare and exchanges.  */


Thanks.

A while ago we tried hard to remove all code that would fail silently
when a macro had a typo in the name, for example.  We still have a few
of them left (e.g., in the atomics), but I think this here doesn't
warrant an exception.

Okay. You're right.

In comment of patch 2/2, you've mentioned a header where an architecture
shall define those parameters.
Thus I've added the file nptl/pthread_spin_parameters.h.


I think the stub version should be in sysdeps/ but the wiki is missing
this information (default ENOTSUP location):
https://sourceware.org/glibc/wiki/Style_and_Conventions#Proper_sysdeps_Location

In case we need a new header, shall we move it to sysdeps/nptl/ ?


I think we should just get rid of SPIN_LOCK_READS_BETWEEN_CMPXCHG, and
make some choice for SPIN_TRYLOCK_LOAD_AND_TEST_BEFORE_XCHG.  There is
no technical reason for throwing in a CAS every now and then, and so far
we have no evidence that it can improve performance (if that would be
the case, we'd need to think harder about that, because we have
spin-waiting loops elsewhere that don't want to modify the value after
waiting for a change of the value).

Okay. For s390 we don't need a CAS every now and then.


With that applied you don't have to solve the header problem in this
patch.

I'll write a separate email with suggestions about how to refactor the
spinlock code project-wide.  Also, some more comments on the tuning
parameters below.

I'm still a bit undecided about how to deal with other arch's
pthread_spin_lock.c files that just set SPIN_LOCK_READS_BETWEEN_CMPXCHG
and then include the generic pthread_spin_lock.c.  Maybe it's best if we
just remove them in this patch of yours.

I think the archs currently using the generic implementation have justcopied the default value to get rid of the warning "machine-dependentfile should define SPIN_LOCK_READS_BETWEEN_CMPXCHG". But this is only anassumption.In general removing those "wrapper"-pthread_spin_lock.c files and usinginformation from a header like the proposed pthread_spin_parameters.h oratomic-machine.h is a good approach.


+#if SPIN_TRYLOCK_USE_CMPXCHG_INSTEAD_OF_XCHG == 1

+  /* Load and test the spinlock and only try to lock the spinlock if it is

+     free.  */

+  int val = atomic_load_relaxed (lock);

+  if (__glibc_likely (val == 0 && atomic_exchange_acquire (lock, 1) == 0))


I think that's not quite the same as a choice between CAS and exchange.

Yes. You are right.

Doing a load first could be helpful for *both* CAS and exchange.  OTOH,
if we assume that trylock will succeed most of the time, then the load
is unnecessary and might be more costly (eg, if it causes the state of
the cache line to change twice).

E.g. on s390, there exists no instruction which atomically exchanges the
value. Instead a CAS loop is used. The CAS instruction locks the
cache-line exclusively whether the value in memory equals the expected
old-value or not. Therefore the value is loaded and compared before
using the CAS instruction.
If no other CPU has accessed the lock-value, the load will cause
the state of the cache line to exclusive.
If the lock is not acquired, the subsequent CAS instruction does not
need to change the state of the cache line.
If another CPU has accessed the lock-value e.g. by acquiring the lock,
the load will cause the state of the cache line either to read-only
or exclusive. This depends on the other CPU - whether it has already
stored a new value or not.

As this behaviour depends on the architecture, the architecture has to
decide whether it should load and test the lock-value before the
atomic-macro.


I agree that the there are architecture-specific properties of atomic
operations that have a significant effect on performance.  For s390, you
list the following points:
* atomic exchange (I suppose all atomic read-modify-write ops?) are
implemented through CAS.

atomic exchange is implemented by a CAS loop due to lack of an exchangeinstruction. For exchanging to a "0", s390 can use the load-and-andinstruction instead of a CAS loop (See my atomic-machine.h patch; Forgcc, we plan to emit such a load-and-and instruction for builtin__atomic_exchange_n in future;). This also saves one register usagecompared to the CAS loop.Exchanging to "0" is e.g. used for lll_unlock macro or inmalloc_consolidate.

Starting with z196 zarch CPUs, the following instructions which are usedby the appropriate __atomic_fetch_xyz builtins instead of a CAS loop:

load-and-add, load-and-and, load-and-exclusive-or, load-and-or.
As information:

I will update my atomic-machine.h patch to use at least some of the C11atomic builtins or all depending on gcc version.

As additional information:
E.g. RHEL 7 / SLES 12 are using z196 as architecture baseline level.
A 64bit s390x glibc build always use zarch instructions.

A 31bit s390 glibc build does not use zarch instructions by default, butsomebody can build glibc with gcc -mzarch option.

The latter means, that a CAS loop is used for a default 31bit s390 build.

* CAS always brings the cache line into an exclusive state.

Yes that's true.


(Is the second point correct? I'm not quite sure I understood you
correctly: for example, you write that a load move a cache line to an
exclusive state, which seems odd).

Not in all cases, but in the case if no other CPU accesses this cacheline. Otherwise it is read-only.


These properties are specific to the architecture, but they are not
specific to the synchronization code (eg, to spinlocks).  Thus, I think
such settings should be in the atomics headers (i.e., in
atomic-machine.h).
This should probably be a separate patch.  I would propose the following
macros (both are bool):
/* Nonzero iff all atomic read-modify-write operations (e.g., atomic
exchange) are implemented using a CAS loop.  */
#define ATOMIC_RMW_USES_CAS 1

Is one macro enough to describe all read-modify-write operations ininclude/atomic.h?

Please also see my comment above.

/* Nonzero iff CAS always assumes it will store, and thus has cache
performance effects similar to a store (e.g., it always puts the
respective cacheline into an exclusive state, even if the comparison
against the expected value fails).  */
ATOMIC_CAS_ALWAYS_PREPARES_FOR_STORE 1

I'm not sure whether there are architectures for which the second macro
would not be true.  It would be good to investigate this, maybe we don't
need to add this at all.

We plan that in future gcc will emit e.g. a load-and-test instruction infront of the CAS instruction if the old-value is a constant zero.



For the spin lock specifically, we have the following possibilities:

1) If we assume that trylock will most likely succeed in practice:
* We just do an exchange.  The properties above don't matter.

2) If we want to bias towards cases where trylock succeeds, but don't
rule out contention:
* If exchange not a CAS loop, and exchange is faster than CAS, do an
exchange.
* If exchange is a CAS loop, use a weak CAS and not an exchange so we
bail out after the first failed attempt to change the state.

3) If we expect contention to be likely:
* If CAS always brings the cache line into an exclusive state, then load
first.  Then do 2).

Do we have consensus yet on whether we assume 2 or 3 (I don't think 1
buys us anything over 2)?

Yes. Sounds good.

diff --git a/nptl/pthread_spin_unlock.c b/nptl/pthread_spin_unlock.c
index 5fd73e5..f83b696 100644
--- a/nptl/pthread_spin_unlock.c

+++ b/nptl/pthread_spin_unlock.c

@@ -23,7 +23,9 @@
 int
 pthread_spin_unlock (pthread_spinlock_t *lock)
 {
-  atomic_full_barrier ();

-  *lock = 0;

+  /* The atomic_store_release synchronizes-with the atomic_exchange_acquire

+     or atomic_compare_exchange_weak_acquire in pthread_spin_lock /

+     pthread_spin_trylock.  */

+  atomic_store_release (lock, 0);

   return 0;
 }


I agree with this change.  However, depending on how one interprets
POSIX' memory model, one may expect lock releases to be sequentially
consistent.  Nonetheless, IMO, glibc should use only release MO.

But we need to announce this change.  Some of us have been considering
an additional section in the manual were we specify where we deviate
from POSIX; this change might be the first we're listing there (though
to be fair, this is a deviation only on some architectures because on
others such as powerpc I think, we've been using release MO forever).

As not all architectures use the generic spinlock implementation, what
about a NEWS entry like:

* The new generic spinlock implementation uses C11 like atomics.
   The pthread_spin_unlock implementation is now using release
   memory order instead of sequentially consistent memory order.


That's a good start, but I think we need to be more specific.  We also
don't need to say that we use C11-like atomics because that's an
implementation detail.

I'd add something like this instead:

The synchronization that pthread_spin_unlock performs has been changed
to now be equivalent to a C11 atomic store with release memory order to
the spin lock's memory location.  This ensures correct synchronization
for the spin lock's operations and critical sections protected by a spin
lock.  Previously, several (but not all) architectures used stronger
synchronization (e.g., containing what is often called a full barrier).
This change can improve performance, but may affect odd fringe uses of
spin locks that depend on the previous behavior (e.g., using spin locks
as atomic variables to try to implement Dekker's mutual exclusion
algorithm).

This should highlight the potential change for the weird (ab)uses of
spin locks but also makes it clear that this won't affect correctness of
all the common code.

Okay. The next patch-iteration will use the text above.

Follow-Ups:
- Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.
  - From: Torvald Riegel

References:
- Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.
  - From: Stefan Liebler
- Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.
  - From: Torvald Riegel
- Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.
  - From: Stefan Liebler
- Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]