This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lock elision test results


On Wed, 2013-07-03 at 14:07 +0200, Dominik Vogt wrote:
> On Wed, Jul 03, 2013 at 01:03:52PM +0200, Torvald Riegel wrote:
> > On Wed, 2013-07-03 at 08:53 +0200, Dominik Vogt wrote: 
> > > On Tue, Jul 02, 2013 at 02:18:04PM +0200, Torvald Riegel wrote:
> > > > On Fri, 2013-06-14 at 12:26 +0200, Dominik Vogt wrote:
> > > > >   barrier
> > > > > 
> > > > > thread 2:
> > > > > 
> > > > >   barrier
> > > > >   get start timestamp
> > > > >   while thread 1 has not finished
> > > > >     lock m1
> > > > >     increment c3
> > > > >     unlock m2
> > > > >   get end timestamp
> > > > > 
> > > > > Performance is measured in loops of thread 2 divided by the time
> > > > > taken.
> > > > 
> > > > Why just thread 2's time?  Without looking at thread 1's performance too
> > > > we can't distinguish throughput from fairness effects.
> > > 
> > > As the threads are synchronized they take the same amount of time.
> > 
> > Right, I missed that.  Nonetheless, thread 2 can do more work depending
> > on whether it gets to acquire m1 more frequently than thread 1 or not.
> > If thread 2 does acquire m1 only infrequently for some reason, this will
> > decrease thread 1's time but it doesn't mean that we're now necessarily
> > faster -- we might just do less work overall, so less throughput.  Or if
> > thread 2 holds m1 most of the time (or, acquires it frequently), then
> > thread 1's time will be larger but we also do more work.
> > 
> > IOW, I think we need to report the iterations of thread 2 too, or let
> > thread2 do a fixed number of iterations and measure the time it needs
> > for those.
> 
> I don't see what difference that makes.

Well, if thread 1 runs n iterations and thread 2 runs n2 iterations,
then the overall throughput is, simplified, n+n2.  Roughly, if we take
longer to run n iterations, yet n2 is bigger at the same time, we have
the same throughput.  Or if we need less time, but n2 is lower, then
same throughput as well.  That's why I think we need to consider the
iterations of thread 2 as well.

> But regardless of whether
> <n> is 2000, 5000, 10000 or 20000, thread 2 always completes
> roughly the same number of iterations per second.

Do you mean the same number of iterations for each value of <n>, or the
same number of iterations across all these values of <n>?

> Also, if I let
> thread 2 execute a fixed number of iterations and terminate the
> loop in thread 1 when thread 2 is done, the number of iterations
> per second is the same.

Sorry, the same as what?

> > > Speaking about the actual test result:
> > > 
> > > 1. In (4), thread 1 has a high abort ratio, i.e. it needs lots of
> > >    additional cpu cycles and the whole test runs much longer than
> > >    it would without aborts.
> > > 2. Thread 2 does not benefit much from the additional run time,
> > >    the number of iterations is roughly the same as without
> > >    elision.
> > 
> > Does it have a high abort rate too?  If not, it should benefit from the
> > additional runtime that thread 1's effectively gives to thread 2.
> 
> I don't think it has lots of aborts other than the ones that are
> caused by the mutex being locked.  To me it looks like the aborts
> cause by the write-write conflict always hit thread 1, but I don't
> know for sure at the moment.  When thread 2 aborts because the
> mutex was locked, it blocks on the lock and waits for thread 1 to
> complete before it can continue.
> 
> > What's the ratio between the increase in thread 2's iterations and the
> > increase in thread 1's iteration?
> 
> See above.
> 
> A quick printf shows that thread 2 does about
> 
> (1) 14000*n
> (2) 13500*n
> (3) 14700*n
> (4)  6800*n
> 
> iterations at the same time thread 1 does one outer iteration.

The outer iteration is one iteration in the "repeat <n>" loop?  If so,
this would be surprising, I believe.  The current parameters of the
adaptation algorithms should lead to thread 1 acquiring m1 without
elision after a few retries (same applies to m2, so worst case two times
a few retries).  Once it does that, it won't ever touch m1, so thread 2
can work without disturbances (unless there is some false sharing
somewhere).  At which point we should have the option to do the same as
in case (3);  thus, the difference is surprising to me.  Do you have any
explanations or other guesses?

> > > > > Test execution
> > > > > --------------
> > > > > 
> > > > > The test is run ten times each with four different versions and
> > > > > setups of glibc:
> > > > > 
> > > > > (1) current glibc without elision patchs (2506109403de)
> > > > > (2) glibc-2.15
> > > > > (3) current glibc (1) plus elision patchs, GLIBC_PTHREAD_MUTEX=none
> > > > > (4) current glibc (1) plus elision patchs, GLIBC_PTHREAD_MUTEX=elision
> > > > > 
> > > > > The best results of all runs for each glibc setup are compared.
> > > > 
> > > > What's the variance between the results from different test runs?  At
> > > > least having min/max/avg would be good, or perhaps the median, or
> > > > something like that.
> > > 
> > > It's a bit difficult to answer this without posting absolute
> > > numbers.  The range of measurements is large enough to be
> > > irritating in all test setups, even in the early morning when
> > > nobody but me works on the machine.
> > 
> > What do you mean by "irritating"?
> 
> By irritating I mean that I have no explanation for them.
> 
> In test 2 the range is much larger, and sometimes the performance
> of a single test setup is roughly 50% or 200% of the "normal"
> performance.  However, I never see 30%, 60%, 80% or 120%.
> I guess that once a test program runs into unfavourable timing
> conditions between the threads, it has trouble getting out of them
> again.
> 
> > > > If it did, you almost never
> > > > abort, and you just measure thread 2's performance, then why the
> > > > overhead?
> > > 
> > > > Are aborts extremely costly on your architecture?
> > > 
> > > Yes.
> > 
> > That's interesting.  Can you give a rough range of how costly they are?
> > Or what is happening internally?
> 
> A full list of the steps during transaction abort handling can be
> found on page 5-100 of the "z/Architecture, Principles of
> Operation" (SA22-7832-09).  It's available somewhere on the web.
> In short, the cpu cleans up the cache (rolling back transactional
> stores and committing nontransactional stores), restores the PC
> and the general purpose registers (specified by TBEGIN) and writes
> a transaction diagnostic block which is a block of 256 bytes that
> contains the abort address, abort reason, a copy of the general
> purpose registers when the at the point of abortion (the caller
> provides a pointer to the tdb if he wants one).  All this is done
> by millicode.
> 
> There's a paper describing the HTM hardware implementation on z:
> 
>   http://dl.acm.org/citation.cfm?id=2457483
> 
> There are datails about the implementation of the instructions on
> pages 32 and 33.  TBEGIN, TBEGINC and TEND are implemented in
> hardware, TABORT, ETND and PPA are implemented as millicode and
> thus slower.  I have access to the internal document that lists
> the cycles these instructions may take, but I must not quote from
> it.
> 
> Maybe I can say that TABORT (and probably a normal abort too)
> takes by an order of magnitude more time than a TBEGIN  which in
> turn takes an order of magnitute more time than TEND.  Roughly:
> 
>   TABORT: very expensive
>    >>
>   TBEGIN: expensive
>   ETND:   expensive (extract transaction nesting depth)
>    >>
>   TEND:   cheap

Thanks, that's helpful to udnerstand what's going on.

> 
> I heard that ETND execution time is goint to be "fixed" in the
> future.
> 
> Also, an abort does not restore floating point registers, so there
> is an additional overhead before TBEGIN to save the non-clobbered
> floating point registers and to restore them after TEND when
> necessary.
> 
> Overall, I'd say a simple TBEGIN-TEND pair around a single
> instruction is expensive enough to be worthwhile only if it can
> prevent a cache miss (cache pingpong).

So, elision can be more efficient if it succeeds in the first attempt
and the lock is contended (so elision can avoid the cache miss on the
lock)?  (This doesn't consider that our current mutex with elision don't
do any spinning, so if the lock is contended, it's likely costlier on
the non-elision path too because we block on the futex).



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]