This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 1/2] S390: Optimize atomic macros.


On 11/24/2016 01:51 PM, Torvald Riegel wrote:
On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
The atomic_compare_and_exchange_val_acq macro is now implemented with
gcc __sync_val_compare_and_swap instead of an inline assembly with
compare-and-swap instruction.

Can you use the new GCC atomic builtins, or are they still insufficient
on s390?

Related to that, can you define USE_ATOMIC_COMPILER_BUILTINS to 1?
Generally, this is where we want to get to, so that we have to maintain
as few atomic operations as possible.

Also note that we'll be phasing out the old-style glibc atomics
eventually and use the new C11-like atomics.
I've activated it and tried an example like this (also compare it with pthread_once):
int
foo (int *lock)
{
  int val, newval;
  val = atomic_load_acquire (lock);
  do
    {
      newval = val | 123;
    }
  while (!atomic_compare_exchange_weak_acquire (lock, &val, newval));

  return 0;
}

The assembly produced by GCC 5.4/6.2 contains a new stack-frame, the old value is stored and loaded to/from the stack-frame before the cs-instruction. After the cs, the old value is stored again and reloaded if a further round of cs is needed. The condition-code of cs is extracted into a register and then compared to determine if a further round of cs is needed.

The cs-instruction itself updates the old-value-register and a conditional jump can be used for a further round.

An upstream gcc produces better code, but it's also not perfect!
I've already informed Andreas Krebbel.

Thus I won't set USE_ATOMIC_COMPILER_BUILTINS to 1 now.
And if gcc is fixed, I will activate it only for this version and newer.


The memory is compared against expected OLDVAL before using compare-and-swap
instruction in case of OLDVAL is constant at compile time.  This is used in
various locking code.  If the lock is already aquired, another cpu has not to
exclusively lock the memory.  If OLDVAL is not constant the compare-and-swap
instruction is used directly as the usages of this macro usually load the
current value in front of this macro.

Generally, I think the compiler should do these optimizations (under the
assumption that we drive these using builtin_expect and the like).  If
GCC not doing these (with new _atomic* builtins), is there a patch about
that or at least a BZ?  I'd be okay with accepting an intermediate
solution that optimizes atomics in glibc, but I'd want to make this
clear in code comments and reference the GCC BZ that makes the
improvement request; this ensures that we can drop the glibc hack once
GCC does what you want it to do.

No. This comparision will not be included in the gcc builtins as they are intended to only be compare-and-swap instruction.

For this particular case, I'm hoping we can just fix this in the
compiler.

Regarding your patch, the compiler barrier is in the wrong position for
an acquire MO operation; for the new C11-like atomics, it would be
unnecessary because we only guarantee relaxed MO in the failure case.

If *mem is equal to constant oldval, the sync builtin is considered as full barrier after loading *mem. If *mem is not equal to constant oldval, you are right there is no barrier after the load and the compiler could hoist memory accesses before the load of mem!?
Thus I can add the compiler barrier in the else path like below.
+#define atomic_compare_and_exchange_val_acq(mem, newval, oldval)\
+  ({ __asm__ __volatile__ ("" ::: "memory");			\
+    __typeof (*(mem)) __atg1_ret;				\
+    if (!__builtin_constant_p (oldval)				\
+	|| __builtin_expect ((__atg1_ret = *(mem))		\
+			     == (oldval), 1))			\
+      __atg1_ret = __sync_val_compare_and_swap ((mem), (oldval),\
+						(newval));	\
+    else							\
+      __asm__ __volatile__ ("" ::: "memory");			\
+    __atg1_ret; })
The intention of the full barrier before the load and the if-statement is to "force" the load and compare directly before cs instruction.

The same applies to atomic_compare_and_exchange_bool_acq which wasn't
defined before.  Now it is implemented with gcc __sync_bool_compare_and_swap.
If the macro is used as condition in an if/while expression, the condition
code is used to e.g. jump directly to another code sequence.  Before this
change, the old value returned by compare-and-swap instruction was compared
with given OLDVAL to determine if e.g. a jump is needed.
I will add the barrier here in the same way as mention above.

The atomic_exchange_acq macro is now using the load-and-and instruction for a
constant zero value instead of a compare-and-swap loop.  This instruction is
available on a z196 zarch and higher cpus.  This is e.g. used in unlocking code.

See above.  This should be fixed in the compiler.
There is no sync-builtin which can be used for exchanging the value as done in atomic_exchange_acq. Thus I have to do it in this glibc-macro. But you are right, the compiler can fix this for c11 builtin __atomic_exchange_n. But even if it is fixed in a new version, builds with older compilers won't use the load-and-and instruction. In case of e.g. lll_unlock, an additional register loaded with the old-value and an additional jump is needed.


The newly defined atomic_exchange_and_add macro is implemented with gcc
builtin __sync_fetch_and_add which uses load-and-add instruction on z196 zarch
and higher cpus instead of a loop with compare-and-swap instruction.
The same applies to atomic_or_val, atomic_and_val, ... macros, which use
the appropiate z196 instruction.

Can you just use the new _atomic builtins?  The compiler should just
give use _atomic* builtins that are optimized, and we should use them.

I've checked, if __atomic_fetch_or|add use lao|laa instructions.
And yes they use it, too.
As there are issues with the __atomic* builtins as mentioned above,
the usage of the __sync_fetch_* is currently needed here.
I don't intend to use a mix of __sync and __atomic builtins.

With that in place, we could just implement the old-style glibc atomic
operations based on the _atomic* builtins, and have much less code to
maintain.

Yes, you are right, this is the correct way in future.
I saw at least aarch64 is doing that this way.

The macros lll_trylock, lll_cond_trylock are extended by an __glibc_unlikely
hint. With the hint gcc on s390 emits code in e.g. pthread_mutex_trylock
which does not use jumps in case the lock is free.  Without the hint it had
to jump if the lock was free.

I think it's debatable whether trylock should expect the lock to be
free.  OTOH, if it's not free, I guess that we should be in the slow
path in most use cases anyway.  Either way, I think it's a choice we
should make generically for all architectures, and I want to keep
arch-specific code to a minimum.  Thus, please split this out and
propose it for the generic lowlevellock.

Okay. I will send a separate patch to propose this change for common-code.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]