This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [RFC] nptl: use compare and exchange for lll_cond_lock


On Fri, 2014-10-10 at 09:45 -0300, Adhemerval Zanella wrote:
> On 10-10-2014 09:33, Torvald Riegel wrote:
> > On Fri, 2014-10-10 at 08:43 -0300, Adhemerval Zanella wrote:
> >> On 25-09-2014 21:20, Adhemerval Zanella wrote:
> >>> On 25-09-2014 15:00, Torvald Riegel wrote:
> >>>> On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote:
> >>>>> While checking the generated code and macros used in generic lowlevellock.h,
> >>>>> I noted powerpc and other arch uses uses a compare and swap instead of a plain
> >>>>> exchange value on lll_cond_lock.
> >>>>> I am not really sure which behavior would be desirable, since as far I could
> >>>>> they will have both the same side effects (since lll_cond_lock, different
> >>>>> from lll_lock, does not hold value of '1').
> >>>> What do you mean by "[the function] does not hold value of '1'"?
> >>> Bad wording in fact, I mean the 'futex' used in lll_cond_lock.
> >>>
> >>>>> So I am proposing this patch to sync default implementation for what mostly
> >>>>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock.  I see
> >>>>> that only microblaze and sh (I am not sure about this one, I not well versed in
> >>>>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq
> >>>>> instead.
> >>>> I think both versions work from a correctness POV, but doing an
> >>>> unconditional exchange should be the right thing to do.
> >>>>
> >>>> The default implementation of __lll_lock_wait will test if the futex
> >>>> variable equals 2, and if not, do an exchange right away before running
> >>>> the FUTEX_WAIT syscall.  So if the CAS that you propose fails, the next
> >>>> thing that will happen is an exchange.  Thus, it seems that we should do
> >>>> the exchange right away.
> >>>>
> >>>> Thoughts?
> >>> The only 'advantage' I see on using the compare and exchange version is it might be
> >>> an optimization on architectures that uses LL/SC instead of CAS instruction.  For
> >>> instance on POWER, the exchange version is translated to:
> >>>
> >>> 	li 	r9,2
> >>>  1:     lwarx   10,0,3,1
> >>>         stwcx.  9,0,3
> >>>         bne-    1b
> >>>         isync
> >>>
> >>> And for compare and exchange:
> >>>
> >>> 	li	r10,2
> >>> 	li 	r9,0
> >>> 1:      lwarx   r8,r0,r3,1
> >>>         cmpw    r8,r9
> >>> 	bne     2f
> >>> 	stwcx.  r10,r0,r3
> >>> 	bne-    1b
> >>> 2:      isync
> >>>
> >>> So for contend cases if the lock is taken it avoids the store (which for POWER8 is
> >>> at least 10 cycles to more).
> >> Does this analysis make sense?
> > I can't comment on POWER because I haven't obtained nor seen any data on
> > this.  Not sure what branch prediction and speculative execution would
> > make out of this.
> >
> > However, another thing to not from a correctness perspective is that
> > depending on the calling code, it *can make a difference* whether you do
> > the exchange or the CAS: If the exchange uses a release barrier, then if
> > the replacement CAS' load avoids the store, you will also not get the
> > effect of the release barrier.  Which can matter depending on how you
> > build the calling algorithm.  For the POWER case, you could try to
> > counter that with an appropriate barrier in the failure path of the CAS.
> > Anyway, it's not a simple replacement.
> 
> But this change is specific for 'acquire' semantics and, at least for POWER,
> it won't generate memory barrier (isync is not a memory barrier).  I don't see
> any correctness issue with this change for this specific usage.

Well, it currently is acquire, I agree, and there whether we store or
not does not make a difference.  However, depending on how we interpret
trylock semantics, lock acquisition (not in the lowlevellock per se, but
when lowlevellock is used for pthread_mutex_lock) might have to have
release+store semantics.  So, while this doesn't matter for how we have
lowlevellock implemented now (at least on Power and in the generic
version), it might in the future.

So, yes, this was a general comment.  But it could apply to lowlevellock
in the future, so we should keep this in mind (preferably in a comment
in the code, if we decide to keep the CAS).

> >
> >> Also, I'm not sure which is better for x86_64 or other
> >> architectures.
> > The specifc read-modify-write should be faster.  The actual operation is
> > easier to see for the chip, the window where something else can
> > interfere should be smaller, etc.  I haven't done recent measurements,
> > but a few years ago a CAS increment loop scaled significantly worse than
> > an atomic-increment on x86.
> >
> Right and I don't have also number for powerpc to check if a the read-modify-write should
> scale better. Do you recall how you do evaluated the CAS scaling loop on x86_64? I would
> like to check on powerpc.

IIRC, I just measured throughput on a shared integer counter.  So,
several threads increment, and you measure how throughput of increments
scales with an increasing number of threads.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]