This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: Spinlock optimization and cleanup


On Fri, 2015-02-27 at 11:57 -0300, Adhemerval Zanella wrote:
> On 27-02-2015 09:20, Torvald Riegel wrote:
> > On Thu, 2015-02-26 at 18:35 -0300, Adhemerval Zanella wrote:
> >> This patch optimizes powerpc spinlock implementation by:
> >>
> >> * Current algorithm spin over a lwzx, but only after issuing a lwarx.
> >>   first.  The optimization for such case is to avoid the first lwarx
> >>   in contention case (which is much more costly than normal loads).
> > IOW, the previous code was TAS-and-test-and-TAS loop, and you're
> > changing it into a test-and-TAS loop?
> 
> If by 'test-and-TAS loop' you mean it is now loop on lwarx. instruction, that
> is not the case: the first test is a sniff 'check' on value to avoid the costly
> LL/SC instruction that create the cacheline reservation.  The loop over a not
> committed lwarx. is still doing using normal load operations.
> 
> >
> > Are you sure that this gives enough benefit for the contended case to
> > offset potential disadvantages for the uncontended case in which the
> > lock is not yet in the cache?  That is, whether the unnecessary
> > transition to write ownership when the lock is already acquired is more
> > important to avoid than the potential transition to shared first
> > followed by a transition to ownership?
> > (I'm not familiar with powerpc enough to answer any of that, I just want
> > to make sure the reason for our decisions are clear and documented.
> > I've done a few measurements on other architectures in the past, and
> > there the test-and-TAS wasn't necessarily better...)
> 
> In fact this 'hint' that a test sniff should be used came from directly
> from our chip designers.  However their were not really clear why the sniff
> is really necessary, just that lwarx. is indeed way more costly and can
> change contention and add internal bus traffic in contend case.  So in 
> a short, I am not 100% sure this is the best course of action.
> 
The background here is that POWER systems have traditionally run the
coherence protocol (and reservations) from the L2 (for scale up
servers). This is scales well for large servers with many independent
processes, but shows poorly for single threaded benchmarks.

Recent processor designs (P7+ and P8) have added improvements that
optimize when the lock word is already in the L1. So a simple load as
the prefetch is helpful on these processors.

Also the LARX engages some heavy machinery that can slow down atomic
updates (while snooping) from other threads. So is best to spin on a
simple load, while blocked, until the lock work state change is visible
to this thread.

We are getting this information directly from the designer of the memory
nest. This detail should be documented in POWER8 Users Manual that is
available from the OpenPOWER site.

> >
> > Do you have benchmarks to show the effects?  In this particular case,
> > I'd trust the machine maintainers with keeping the decisions we make
> > up-to-date wrt current hardware (ie, maintain the code so that we always
> > make the best trade-offs in the future); but it would be even better if
> > we had benchmarks.
> 
> That's the trick part: we are discussing this w.r.t some internal workloads nad
> results are still yet to come.  However, I am trying myself to check some results
> using a subpar benchmark (basically sysbench memory using spinlock instead of 
> pthread_mutex_t).  I am seeing some results with hight threads counts, but also
> some regression on lower threads counts.  Also, the lwarx hint bit is showing
> some difference and also it shows some gains and regressions.  In the end, I think
> the workloads I am using is not best suited for this kind of evaluation. Any
> suggestions?
> 
> > Have you investigated with adding back-off for the contended case?
> 
> That something I have not investigate.
> 
> >
> > What are your assumptions about the workload and the relative importance
> > of individual cases (e.g., likelihood of contended vs. uncontended)?  We
> > should agree on those globally (ie, across archs), and document them.
> 
> In fact I did not hold any assumptions, however I have noted on recent powerpc
> ports I have been involved that spinlocks are used on various DBs (and usually
> with hand-crafted ones...).  However I do not have data about what kind of case
> they assume either.
> 
> >
> >> * Use the correct EH hint bit on the larx for supported ISA.  For lock
> >>   acquisition, the thread that acquired the lock with a successful stcx
> >>   do not want to give away the write ownership on the cacheline. The
> >>   idea is to make the load reservation "sticky" about retaining write
> >>   authority to the line.  That way, the store that must inevitably come
> >>   to release the lock can succeed quickly and not contend with other
> >>   threads issuing lwarx.  If another thread does a store to the line
> >>   (false sharing), the winning thread must give up write authority to
> >>   The proper value of EH for the larx for a lock acquisition is 1.
> > I would strongly prefer if we could use C code for our locks, including
> > the spinlocks.
> >
> > Ignoring the cacheline ownership hint, is there any reason to use
> > assembly here (with either the current assembly implementation of
> > atomics that we have, or the new GCC __atomic builtins)?  If so, what
> > could / would we need to fix in these implementations of atomics?  Is
> > there a measurable difference?
> >
> > Regarding the ownership hint, what about extending the atomics that we
> > have with support for that?  This hint could be useful in other
> > concurrent code too, and we don't want to have to clone the, say, whole
> > rwlock implementation for powerpc just to add a few such hints.
> 
> We already have support for EH hint in LL/SC instruction for powerpc.  The trick
> in this algorithm is it does not really fix in current atomic semantics we have
> so far, as '__arch_compare_and_exchange_val_32_acq' issues an 'isync' in either
> fail or succeed case.  The trick in this case is just to emit the 'isync' just
> after the stcx. instruction.  Basically the idea is to loop over checking
> when lock is release using default loads 'inside' the CAS-acquire.  Suggestions?
> 
> >
> >> diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
> >> new file mode 100644
> >> index 0000000..7af694f
> >> --- /dev/null
> >> +++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
> >> @@ -0,0 +1,28 @@
> >> +/* pthread_spin_unlock -- unlock a spin lock.  PowerPC version.
> >> +   Copyright (C) 2007-2015 Free Software Foundation, Inc.
> >> +   This file is part of the GNU C Library.
> >> +
> >> +   The GNU C Library is free software; you can redistribute it and/or
> >> +   modify it under the terms of the GNU Lesser General Public
> >> +   License as published by the Free Software Foundation; either
> >> +   version 2.1 of the License, or (at your option) any later version.
> >> +
> >> +   The GNU C Library is distributed in the hope that it will be useful,
> >> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> +   Lesser General Public License for more details.
> >> +
> >> +   You should have received a copy of the GNU Lesser General Public
> >> +   License along with the GNU C Library; if not, see
> >> +   <http://www.gnu.org/licenses/>.  */
> >> +
> >> +#include "pthreadP.h"
> >> +#include <lowlevellock.h>
> >> +
> >> +int
> >> +pthread_spin_unlock (pthread_spinlock_t *lock)
> >> +{
> >> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
> >> +  *lock = 0;
> >> +  return 0;
> >> +}
> > Is there any reason to not use "atomic_store_release (lock)" here?  Or
> > adapt the generic pthread_spin_unlock to use just a release barrier
> > (which I'd be open to doing too), and get rid of the powerpc-specific
> > file?
> 
> atomic_store_release (lock) is generating correct code, I will change it, thanks.
> 



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]