This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PowerPC: Sync pthread_once with default implementation


On Mon, 2014-04-14 at 11:51 -0300, Adhemerval Zanella wrote:
> This patch removes the arch specific powerpc implementation and instead
> uses the linux default one.  Although the current powerpc implementation
> already contains the required memory barriers for correct
> initialization, the default implementation shows a better performance on
> newer chips.
> 
> I checked the default implementations with some other tests for powerpc
> and everything looks ok.  The only nit that was puzzling me was code
> difference I saw checking if a load acquire instruction sequence would
> yield better performance than the relaxed load plus acquire memory fence.
> From Torvald Riegel comment:
> 
> > Okay.  If lwsync is preferable for an acquire load, it might be best to
> > check the GCC atomics implementation because I believe it's following
> > http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html, which suggests
> > isync for acquire loads.
> 
> I rechecked ISA documentation and some literature and the only inhibitor
> about using an 'lwsync' as an acquire barrier is because it should not
> be used on 'Write Through Required and Caching Inhibited' storage location,
> which AFAIK are used only in restricted kernel areas.  In fact, load/store
> with reservations does not work on such kind of memory attributes,  so for
> general userland lock mechanism 'lwsync' is safe to be used as acquire
> memory fence.
> 
> GCC it self translate C11 atomic_thread_fence (memory_order_acquire) to
> 'lwsync', so a relaxed load plus memory acquire fence will be translated
> to 'ld; lwsync'.  Two advantage by using 'ld; cmp; bc; isync' for
> acquire loads is 1) it is more strict, thus safer (although still not
> really required) 2) is it performs much better some chips (POWER6), while
> showing very similar performance on other POWER platforms.

Thanks for looking into this.

> I will see if it is worth to change the current core for acquire load/
> release store, at least for POWER.  The only chip that I saw a big improvement
> of doing it is on POWER6.

Sounds good.  I think that in the long term, having atomic.h provide
acquire loads and release stores wouldn't be bad.  I'd prefer doing that
instead of running with a POWER6-specific pthread_once variant.

> Also, this testcase itself is single-thread and think it would be worth to
> extend it to check contention on multithread cases as well.

Probably.  Although given that the contented case happens once per
pthread_once instance, it could be rather hard to trigger; to increase
the chance that one hits it often, one would need to involve other
synchronization, which could skew the results.

The patch is OK in my opinion.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]