This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.


On Mon, 2017-02-20 at 13:15 +0100, Stefan Liebler wrote:
> >>> A while ago we tried hard to remove all code that would fail silently
> >>> when a macro had a typo in the name, for example.  We still have a few
> >>> of them left (e.g., in the atomics), but I think this here doesn't
> >>> warrant an exception.
> >>>
> >> Okay. You're right.
> >>
> >> In comment of patch 2/2, you've mentioned a header where an architecture
> >> shall define those parameters.
> >> Thus I've added the file nptl/pthread_spin_parameters.h.
> >
> > I think the stub version should be in sysdeps/ but the wiki is missing
> > this information (default ENOTSUP location):
> > https://sourceware.org/glibc/wiki/Style_and_Conventions#Proper_sysdeps_Location
> In case we need a new header, shall we move it to sysdeps/nptl/ ?

I would guess that this would be the right place for a stub / generic
variant of such a header, but I'm not quite sure.

> >
> > I'm still a bit undecided about how to deal with other arch's
> > pthread_spin_lock.c files that just set SPIN_LOCK_READS_BETWEEN_CMPXCHG
> > and then include the generic pthread_spin_lock.c.  Maybe it's best if we
> > just remove them in this patch of yours.
> >
> I think the archs currently using the generic implementation have just 
> copied the default value to get rid of the warning "machine-dependent 
> file should define SPIN_LOCK_READS_BETWEEN_CMPXCHG". But this is only an 
> assumption.
> In general removing those "wrapper"-pthread_spin_lock.c files and using 
> information from a header like the proposed pthread_spin_parameters.h or 
> atomic-machine.h is a good approach.

Okay.  So let's do that.

> > I agree that the there are architecture-specific properties of atomic
> > operations that have a significant effect on performance.  For s390, you
> > list the following points:
> > * atomic exchange (I suppose all atomic read-modify-write ops?) are
> > implemented through CAS.
> atomic exchange is implemented by a CAS loop due to lack of an exchange 
> instruction. For exchanging to a "0", s390 can use the load-and-and 
> instruction instead of a CAS loop (See my atomic-machine.h patch; For 
> gcc, we plan to emit such a load-and-and instruction for builtin 
> __atomic_exchange_n in future;). This also saves one register usage 
> compared to the CAS loop.
> Exchanging to "0" is e.g. used for lll_unlock macro or in 
> malloc_consolidate.

I guess this may then be yet another property atomic-machine.h could
expose: Whether an atomic fetch-and-and exists and is not implemented
throough a CAS loop.  Something like lll_unlock would then do an
exchange or fetch-and-and, and if that fails, a CAS loop.

> Starting with z196 zarch CPUs, the following instructions which are used 
> by the appropriate __atomic_fetch_xyz builtins instead of a CAS loop:
> load-and-add, load-and-and, load-and-exclusive-or, load-and-or.
> As information:
> I will update my atomic-machine.h patch to use at least some of the C11 
> atomic builtins or all depending on gcc version.

Thanks.

> > These properties are specific to the architecture, but they are not
> > specific to the synchronization code (eg, to spinlocks).  Thus, I think
> > such settings should be in the atomics headers (i.e., in
> > atomic-machine.h).
> > This should probably be a separate patch.  I would propose the following
> > macros (both are bool):
> > /* Nonzero iff all atomic read-modify-write operations (e.g., atomic
> > exchange) are implemented using a CAS loop.  */
> > #define ATOMIC_RMW_USES_CAS 1
> Is one macro enough to describe all read-modify-write operations in 
> include/atomic.h?
> Please also see my comment above.

Yes, it may not be enough (e.g., you say that s390 may have fetch_and
but not an exchange, but other archs may just have an exchange but no
custom fetch_and).
So maybe we should add ATOMIC_EXCHANGE_USES_CAS and
ATOMIC_FETCH_AND_USES_CAS for now, and add further ones on demand?

> > /* Nonzero iff CAS always assumes it will store, and thus has cache
> > performance effects similar to a store (e.g., it always puts the
> > respective cacheline into an exclusive state, even if the comparison
> > against the expected value fails).  */
> > ATOMIC_CAS_ALWAYS_PREPARES_FOR_STORE 1
> >
> > I'm not sure whether there are architectures for which the second macro
> > would not be true.  It would be good to investigate this, maybe we don't
> > need to add this at all.
> >
> We plan that in future gcc will emit e.g. a load-and-test instruction in 
> front of the CAS instruction if the old-value is a constant zero.

That can be useful, but what I outlined would be about a more generic
case.  If you are going to solve this for your arch in gcc, you might
not need to address this in this patch.

> >
> >
> > For the spin lock specifically, we have the following possibilities:
> >
> > 1) If we assume that trylock will most likely succeed in practice:
> > * We just do an exchange.  The properties above don't matter.
> >
> > 2) If we want to bias towards cases where trylock succeeds, but don't
> > rule out contention:
> > * If exchange not a CAS loop, and exchange is faster than CAS, do an
> > exchange.
> > * If exchange is a CAS loop, use a weak CAS and not an exchange so we
> > bail out after the first failed attempt to change the state.
> >
> > 3) If we expect contention to be likely:
> > * If CAS always brings the cache line into an exclusive state, then load
> > first.  Then do 2).
> >
> > Do we have consensus yet on whether we assume 2 or 3 (I don't think 1
> > buys us anything over 2)?
> >
> Yes. Sounds good.

I read this as stating that you agree that 1) doesn't need to be
considered.  But we still to choose between 2) and 3)  :)

I'm not quite sure what to favor because some of the code out there that
uses trylock just uses it to prevent deadlock (e.g., when trying to
acquire multiple locks at once), whereas other code uses spin_trylock to
implement their own back-off and bounded spinning.
I lean towards assuming that lock acquisitons, including the former use
case for trylock, are supposed to succeed in the common case.  That is,
I would pick 2), but I have no data to back this up.

Either way, whatever we choose, we should document polish the cases 1-3
above and the reasoning behind our choice for one of them, and add it as
a comment to the spinlock code.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]