This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][BZ 18743] PowerPC: Fix a race condition when eliding a lock


On Sun, 2015-08-23 at 14:16 +0200, Torvald Riegel wrote:
> On Thu, 2015-08-20 at 15:48 -0500, Peter Bergner wrote:
> > Let's look at the following test case:
> > 
> >   long
> >   foo (long dest, long src0, long src1, long tries)
> >   {
> >     long i;
> >     for (i = 0; i < tries; i++)
> >       {
> >         if (__builtin_tbegin (0))
> > 	  {
> > 	    dest = src0 + src1;
> > 	    if (dest == 13)
> > 	      __builtin_tabort(0);
> > 	    __builtin_tend (0);
> > 	  }
> >       }
> >     return dest;
> >   }
> > 
> > If I compile this with -O3 -mcpu=power8 -S, I see the "src0 + src1"
> > expression hoisted out of the loop, which I can accept as "ok", since
> > both src0 and src1 are constant for the whole function.

Slight update to the above.  I had too may test cases laying around and
I seem to have confused the output of one test case with the source of
another.  The actual source for what I saw above did not have the if ()
around the __builtin_tbegin (0), so we always executed the transaction
body.  That said, looking at the asm for the source above, the add of
src0 and src1 is still hoisted out of the loop.


> Yes.  src0 and src1 are not observable to other threads nor the tbegin,
> so we're back at sequential code for these.  So, no matter what tbegin
> returns, the result for these will be the same.

Agreed.


> > However, I
> > am also seeing the assignment of dest as well as the compare of dest
> > with 13 also being moved outside the tbegin/tend pair.  This doesn't
> > seem correct to me, since the update to dest should only ever be
> > observable if one or more of the transactions succeeds.

Looking at the real asm for the test case above (and not the one with
out the if ()), we get basically the following C code:

long
foo (long dest, long src0, long src1, long tries)
{
  long i;

  if (tries < 0)
    return dest;

  tmp = src0 + src1;  // expression hoisted out of loop
  cond = tmp == 13;   // comparison and branch moved out of loop
  if (cond)           // cloned loop bodies for dest == and != 13
    {
      // src0 + src1 == 13
      for (i=0; i < tries; i++)
        {
          if (!__builtin_tbegin (0))
            continue;
          __builtin_tabort (0);
          __builtin_tend (0);
          dest = 13;
        }
        return dest;
    }
  else
    {
      // src0 + src1 != 13
      for (i=0; i < tries; i++)
        {
          if (!__builtin_tbegin (0))
            continue;
          __builtin_tend (0);
          dest = tmp;
        }
        return dest;
    }
}


> I think doing the compare outside is fine.

Agreed.


> The assignment to dest could be done outside the txn if this does not
> mess with the other code path for when tbegin returns false.  

Looking at the two loops the optimizer created, I have to agree with
you on this one too.  In the first loop (src0 + src1 == 13), the tbegin.
is guaranteed to fail, since either it will fail due to some internal
conflict or we'll unconditionally execute the tabort.  This will cause
us skip the dest = 13 statement.  In the second loop, if the tbegin.
fails, we'll skip the dest = tmp statement.  If it succeeds, we'll
execute the tend. and dest = tmp statements which is what we expect.
So again, I agree with you.  When I was first looking at the generated
asm, I was thinking that the update of dest was somehow always being
performed, but I see that is not the case.

That said, if I change the code so that src0 and src1 are pointers
and the setting of dest is changed to:

    dest = *src0 + *src1;

I'm seeing the loads of *src0 and *src1 hoisted out of the loop,
which is clearly bad!  This happens because the TM hardware instructions
are not marked as memory barriers.  This is true for POWER, Intel and
S390!



> To summarize, I think tbegin needs to have lock acquisition semantics
> and an unknown return value, and tend needs to have lock release
> semantics.
> 
> Does everyone agree with this reasoning?

I agree.  Since POWER's and I also believe Intel's and S390's TM
hardware instructions have at least those semantics, I believe the
only thing the compiler needs to do is enforce a memory barrier on
those instruction patterns, so the compiler won't schedule loads and
stores across those instructions, correct?

My patch to do this on POWER fixes the unit test cases I have and I
assume the TLE issue we ran into.  Intel and S390 will also need a
similar patch.  I'll mention that as part of my patch, I created a
__TM_FENCE__ macro so the user can tell whether the TM insn patterns
have been fixed to act as fences or not.


> Second, beyond concurrency, we have to consider whether we get
> additional constraints on code generation because of txn aborts.

The only additional constraint I can think of that might be useful,
is to mark the tabort. as noreturn.  However, noreturn is a function
attribute, so we can't really attach it to a TM insn pattern.
Even if we could, it wouldn't make sense to attach it to our
conditional transaction abort instructions.  Probably users should just
follow their __builtin_tabort (0) uses with __builtin_unreachable().

I will mention that I did end up adding memory barriers to our tabort
and conditional tabort insn patterns.  It's "safe" and I assume not
much of a performance issue, since aborting a transaction should be
an uncommon occurrence and the tabort is also probably slow too.
I also had to add it to our tsuspend and tresume patterns too, since
we don't want loads/stores moving from a transaction area to a
non-transaction area and vice versa, so it made sense to just
attach the memory barrier to all our transaction insn patterns.



> Peter (or someone else), could you take care of documenting this
> reasoning (or whatever reasoning we agree on) in the GCC sources for the
> HTM builtins (ie, Power, s390, and Intel)?  I can review such a patch,
> and can also help with the wording if necessary.  Thanks!

I'll take a stab at adding this for power and we'll see how generic
it is and whether we can just add it as is or not for Intel and S390.

Thanks for your input!

Peter





Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]