This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: PowerPC: libc single-thread lock optimization
- From: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- To: Torvald Riegel <triegel at redhat dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 29 Apr 2014 13:49:57 -0300
- Subject: Re: PowerPC: libc single-thread lock optimization
- Authentication-results: sourceware.org; auth=none
- References: <5343F8F1 dot 4000400 at linux dot vnet dot ibm dot com> <535ECADE dot 2050004 at linux dot vnet dot ibm dot com> <20140428214938 dot 3B10F2C3A13 at topped-with-meat dot com> <535ED72A dot 5060203 at linux dot vnet dot ibm dot com> <1398788543 dot 32485 dot 1139 dot camel at triegel dot csb>
On 29-04-2014 13:22, Torvald Riegel wrote:
> On Mon, 2014-04-28 at 19:33 -0300, Adhemerval Zanella wrote:
>> On 28-04-2014 18:49, Roland McGrath wrote:
>>> Heretofore sysdeps/CPU/bits/atomic.h is for pure CPU-based implementations.
>>> In a few cases there exists a sysdeps/unix/sysv/linux/CPU/bits/atomic.h as
>>> well because it needs to use kernel support.
>>>
>>> This is something somewhere in between: you are not depending directly on
>>> specific facilities outside the pure CPU facilities; but you are depending
>>> on library infrastructure and associated assumptions that do not hold in
>>> the general case of using the atomic macros in arbitrary contexts.
>>> Furthermore, you are defining SINGLE_THREAD_P to depend on NPTL
>>> implementation details. IMHO neither of these things belong in a
>>> sysdeps/CPU/bits/atomic.h file.
>>>
>>> The lowlevellock.h change doesn't have those issues, so I'd suggest you
>>> send that separately and it should go in easily.
>>>
>>>
>>> Thanks,
>>> Roland
>>>
>> I tend to agree with you, however the x86_64 implementation (sysdeps/x86_64/bits/atomic.h)
>> itself relies on NPTL definitions (the check using (offsetof (tcbhead_t, multiple_threads)))).
>> And the idea of changing the atomic.h it to simplify the logic to add this optimization:
>> instead of changing the macros in lowlevellock.h and other atomic usage, it implements
>> the optimization on the atomic itself.
> I agree with Roland that atomic.h shouldn't have the optimization; to
> me, the strongest reason is that we might need atomics that actually
> synchronize independently of whether we have spawned a thread or used
> cancellation. Also, having this optimization in the atomics will make
> it harder to move to, say, C11 atomics; we'd have to keep the wrappers.
It does not solve the case for internal usage of atomic operations (for instance,
atomic_compare_and_exchange_val_acq in malloc code). The idea of this patch is to
mimic x86 optimization, not to refactor atomic case. So the general idea I got so
far is: this should not be in atomic primitives, but x8_64 does it anyway.
>
>> I bring this about x86 because usually it is the reference implementation and sometimes puzzles
>> me that copying the same idea in another platform raise architectural question. But I concede
>> that the reference itself maybe had not opted for best solution in first place.
>>
>> So if I have understand correctly, is the optimization to check for single-thread and opt to
>> use locks is to focused on lowlevellock solely? If so, how do you suggest to other archs to
>> mimic x86 optimization on atomic.h primitives? Should other arch follow the x86_64 and
>> check for __libc_multiple_threads value instead? This could be a way, however it is redundant
>> in mostly way: the TCP definition already contains the information required, so there it no
>> need to keep track of it in another memory reference. Also, following x86_64 idea, it check
>> for TCB header information for sysdeps/CPU/bits/atomic.h, but for __libc_multiple_threads
>> in lowlevellock.h. Which is correct guideline for other archs?
> >From a synchronization perspective, I think any single-thread
> optimizations belong into the specific concurrent algorithms (e.g.,
> mutexes, condvars, ...)
> * Doing the optimization at the lowest level (ie, the atomics) might be
> insufficient because if there's indeed just one thread, then lots of
> synchronization code can be a lot more simpler than just avoiding
> atomics (e.g., avoiding loops, checks, ...).
> * The mutexes, condvars, etc. are what's exposed to the user, so the
> assumptions of whether there really no concurrency or not just make
> sense there. For example, a single-thread program can still have a
> process-shared condvar, so the condvar would need to use
> synchronization.
Follow x86_64 idea, this optimization is only for internal atomic usage for
libc itself: for a process-shared condvar, one will use pthread code, which
is *not* built with this optimization.
> * We currently don't have other intra-process sources of concurrency
> than NPTL threads, but if we get another source (e.g., due to trying to
> support accelerators), we'd likely need low-level synchronization to
> communicate with the other thing -- and this would be independent of
> whether we have threads.
This optimization does not apply to pthread code.
> I'm wondering whether we should move the x86 atomics special case into
> the concurrent algorithms. Otherwise, C implementations of
> synchronization operations that using atomics for cross-process
> synchronization won't work I guess. We currently don't have those on
> x86 I believe (use of just lowlevellock should be fine), but I suppose
> things won't stay this way.
>