This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [RFC] nptl: use compare and exchange for lll_cond_lock


On Fri, 2014-10-10 at 08:43 -0300, Adhemerval Zanella wrote:
> On 25-09-2014 21:20, Adhemerval Zanella wrote:
> > On 25-09-2014 15:00, Torvald Riegel wrote:
> >> On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote:
> >>> While checking the generated code and macros used in generic lowlevellock.h,
> >>> I noted powerpc and other arch uses uses a compare and swap instead of a plain
> >>> exchange value on lll_cond_lock.
> >>> I am not really sure which behavior would be desirable, since as far I could
> >>> they will have both the same side effects (since lll_cond_lock, different
> >>> from lll_lock, does not hold value of '1').
> >> What do you mean by "[the function] does not hold value of '1'"?
> > Bad wording in fact, I mean the 'futex' used in lll_cond_lock.
> >
> >>> So I am proposing this patch to sync default implementation for what mostly
> >>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock.  I see
> >>> that only microblaze and sh (I am not sure about this one, I not well versed in
> >>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq
> >>> instead.
> >> I think both versions work from a correctness POV, but doing an
> >> unconditional exchange should be the right thing to do.
> >>
> >> The default implementation of __lll_lock_wait will test if the futex
> >> variable equals 2, and if not, do an exchange right away before running
> >> the FUTEX_WAIT syscall.  So if the CAS that you propose fails, the next
> >> thing that will happen is an exchange.  Thus, it seems that we should do
> >> the exchange right away.
> >>
> >> Thoughts?
> > The only 'advantage' I see on using the compare and exchange version is it might be
> > an optimization on architectures that uses LL/SC instead of CAS instruction.  For
> > instance on POWER, the exchange version is translated to:
> >
> > 	li 	r9,2
> >  1:     lwarx   10,0,3,1
> >         stwcx.  9,0,3
> >         bne-    1b
> >         isync
> >
> > And for compare and exchange:
> >
> > 	li	r10,2
> > 	li 	r9,0
> > 1:      lwarx   r8,r0,r3,1
> >         cmpw    r8,r9
> > 	bne     2f
> > 	stwcx.  r10,r0,r3
> > 	bne-    1b
> > 2:      isync
> >
> > So for contend cases if the lock is taken it avoids the store (which for POWER8 is
> > at least 10 cycles to more).
> 
> Does this analysis make sense?

I can't comment on POWER because I haven't obtained nor seen any data on
this.  Not sure what branch prediction and speculative execution would
make out of this.

However, another thing to not from a correctness perspective is that
depending on the calling code, it *can make a difference* whether you do
the exchange or the CAS: If the exchange uses a release barrier, then if
the replacement CAS' load avoids the store, you will also not get the
effect of the release barrier.  Which can matter depending on how you
build the calling algorithm.  For the POWER case, you could try to
counter that with an appropriate barrier in the failure path of the CAS.
Anyway, it's not a simple replacement.

> Also, I'm not sure which is better for x86_64 or other
> architectures.

The specifc read-modify-write should be faster.  The actual operation is
easier to see for the chip, the window where something else can
interfere should be smaller, etc.  I haven't done recent measurements,
but a few years ago a CAS increment loop scaled significantly worse than
an atomic-increment on x86.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]