This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [RFC] nptl: use compare and exchange for lll_cond_lock


On 10-10-2014 09:33, Torvald Riegel wrote:
> On Fri, 2014-10-10 at 08:43 -0300, Adhemerval Zanella wrote:
>> On 25-09-2014 21:20, Adhemerval Zanella wrote:
>>> On 25-09-2014 15:00, Torvald Riegel wrote:
>>>> On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote:
>>>>> While checking the generated code and macros used in generic lowlevellock.h,
>>>>> I noted powerpc and other arch uses uses a compare and swap instead of a plain
>>>>> exchange value on lll_cond_lock.
>>>>> I am not really sure which behavior would be desirable, since as far I could
>>>>> they will have both the same side effects (since lll_cond_lock, different
>>>>> from lll_lock, does not hold value of '1').
>>>> What do you mean by "[the function] does not hold value of '1'"?
>>> Bad wording in fact, I mean the 'futex' used in lll_cond_lock.
>>>
>>>>> So I am proposing this patch to sync default implementation for what mostly
>>>>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock.  I see
>>>>> that only microblaze and sh (I am not sure about this one, I not well versed in
>>>>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq
>>>>> instead.
>>>> I think both versions work from a correctness POV, but doing an
>>>> unconditional exchange should be the right thing to do.
>>>>
>>>> The default implementation of __lll_lock_wait will test if the futex
>>>> variable equals 2, and if not, do an exchange right away before running
>>>> the FUTEX_WAIT syscall.  So if the CAS that you propose fails, the next
>>>> thing that will happen is an exchange.  Thus, it seems that we should do
>>>> the exchange right away.
>>>>
>>>> Thoughts?
>>> The only 'advantage' I see on using the compare and exchange version is it might be
>>> an optimization on architectures that uses LL/SC instead of CAS instruction.  For
>>> instance on POWER, the exchange version is translated to:
>>>
>>> 	li 	r9,2
>>>  1:     lwarx   10,0,3,1
>>>         stwcx.  9,0,3
>>>         bne-    1b
>>>         isync
>>>
>>> And for compare and exchange:
>>>
>>> 	li	r10,2
>>> 	li 	r9,0
>>> 1:      lwarx   r8,r0,r3,1
>>>         cmpw    r8,r9
>>> 	bne     2f
>>> 	stwcx.  r10,r0,r3
>>> 	bne-    1b
>>> 2:      isync
>>>
>>> So for contend cases if the lock is taken it avoids the store (which for POWER8 is
>>> at least 10 cycles to more).
>> Does this analysis make sense?
> I can't comment on POWER because I haven't obtained nor seen any data on
> this.  Not sure what branch prediction and speculative execution would
> make out of this.
>
> However, another thing to not from a correctness perspective is that
> depending on the calling code, it *can make a difference* whether you do
> the exchange or the CAS: If the exchange uses a release barrier, then if
> the replacement CAS' load avoids the store, you will also not get the
> effect of the release barrier.  Which can matter depending on how you
> build the calling algorithm.  For the POWER case, you could try to
> counter that with an appropriate barrier in the failure path of the CAS.
> Anyway, it's not a simple replacement.

But this change is specific for 'acquire' semantics and, at least for POWER,
it won't generate memory barrier (isync is not a memory barrier).  I don't see
any correctness issue with this change for this specific usage.

>
>> Also, I'm not sure which is better for x86_64 or other
>> architectures.
> The specifc read-modify-write should be faster.  The actual operation is
> easier to see for the chip, the window where something else can
> interfere should be smaller, etc.  I haven't done recent measurements,
> but a few years ago a CAS increment loop scaled significantly worse than
> an atomic-increment on x86.
>
Right and I don't have also number for powerpc to check if a the read-modify-write should
scale better. Do you recall how you do evaluated the CAS scaling loop on x86_64? I would
like to check on powerpc.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]