Since Alan is a sleep I will try to the answer. =M should imply the
memory clobber but it would not hurt to be explicit. lwarx/stwcx. are
index forms and r0 as the 2nd parm implies no index, just a base address
in the 3rd register parm. The 4th lwarx parm (MUTEX_HINT_ACQ) is a cache
line optimization.
But the important part is replacing the atomic_increment macro, which
does not include any memory barrier, with a explicit atomic add with a
leading release (___lll_rel_instr) barrier.
With this patch pthread_once implements the required acquire / release
semantics.