This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lock elision test results


On Thu, 2013-07-04 at 11:29 +0200, Dominik Vogt wrote:
> On Wed, Jul 03, 2013 at 03:18:04PM +0200, Torvald Riegel wrote:
> > On Wed, 2013-07-03 at 14:07 +0200, Dominik Vogt wrote:
> > > On Wed, Jul 03, 2013 at 01:03:52PM +0200, Torvald Riegel wrote:
> > > I don't see what difference that makes.
> > 
> > Well, if thread 1 runs n iterations and thread 2 runs n2 iterations,
> > then the overall throughput is, simplified, n+n2.  Roughly, if we take
> > longer to run n iterations, yet n2 is bigger at the same time, we have
> > the same throughput.  Or if we need less time, but n2 is lower, then
> > same throughput as well.  That's why I think we need to consider the
> > iterations of thread 2 as well.
> 
> > > But regardless of whether
> > > <n> is 2000, 5000, 10000 or 20000, thread 2 always completes
> > > roughly the same number of iterations per second.
> > 
> > Do you mean the same number of iterations for each value of <n>, or the
> > same number of iterations across all these values of <n>?
> 
> The same number of iterations per second, no matter of what value
> <n> is.

So no matter how long thread2 has time before it's signaled to stop by
thread 1, it always gets the same amount of work done?  That doesn't
seem right.  Eventually, it should get more work done.

> > > > > Speaking about the actual test result:
> > > > > 
> > > > > 1. In (4), thread 1 has a high abort ratio, i.e. it needs lots of
> > > > >    additional cpu cycles and the whole test runs much longer than
> > > > >    it would without aborts.
> > > > > 2. Thread 2 does not benefit much from the additional run time,
> > > > >    the number of iterations is roughly the same as without
> > > > >    elision.
> > > > 
> > > > Does it have a high abort rate too?  If not, it should benefit from the
> > > > additional runtime that thread 1's effectively gives to thread 2.
> > > 
> > > I don't think it has lots of aborts other than the ones that are
> > > caused by the mutex being locked.  To me it looks like the aborts
> > > cause by the write-write conflict always hit thread 1, but I don't
> > > know for sure at the moment.  When thread 2 aborts because the
> > > mutex was locked, it blocks on the lock and waits for thread 1 to
> > > complete before it can continue.
> > > 
> > > > What's the ratio between the increase in thread 2's iterations and the
> > > > increase in thread 1's iteration?
> > > 
> > > See above.
> > > 
> > > A quick printf shows that thread 2 does about
> > > 
> > > (1) 14000*n
> > > (2) 13500*n
> > > (3) 14700*n
> > > (4)  6800*n
> > > 
> > > iterations at the same time thread 1 does one outer iteration.
> > 
> > The outer iteration is one iteration in the "repeat <n>" loop?
> 
> Yes.
> 
> > If so, this would be surprising, I believe.
> 
> > The current parameters of the
> > adaptation algorithms should lead to thread 1 acquiring m1 without
> > elision after a few retries (same applies to m2, so worst case two times
> > a few retries).
> 
> Agreed.
> 
> > Once it does that, it won't ever touch m1, so thread 2
> > can work without disturbances (unless there is some false sharing
> > somewhere).
> 
> Yes, because both threads use real locks.
> 
> So, while thread 1 does one iteration of 
> 
>   waste a minimal amount of cpu
> 
>   waste some cpu
> 
> thread 2 does
> 
>   lock m1
>   increment c3
>   unlock m2
>  
> It looks like one iteration of thread 2 takes bout six times more
> cpu that an iteration of thread 1 (without elision)

But it doesn't look like it should because it's not doing more work; do
you know the reason for this?

> and with
> alision it takes about 14 times more cpu.

It might be useful to have a microbenchmark that tests at which length
of critical sections the overhead of the transactional execution is
amortized (e.g., if we assume that the lock is contended, or is not, or
with a certain probability).  We need to model performance in some way
to be able to find robust tuning parameters, and find out which kind of
tuning and tuning input we actually need.  Right now we're just looking
at the aborts; it seems that for z at least, the critical section length
should also be considered.

> > At which point we should have the option to do the same as
> > in case (3);  thus, the difference is surprising to me.  Do you have any
> > explanations or other guesses?
> 
> Hm, with the default tuning values (three attempts with elision
> and three with locks), *if* thread 1 starts using locks, thread 2
> would
> 
>   try to elide m1 <------------------\
>     begin a transaction              |
>     *futex != 0 ==> forced abort     |
>   acquire lock on m1                 |
>     increment counter                |
>   release lock on m1                 |
>   acquire lock on m1                 |
>     increment counter                |
>   release lock on m1                 |
>   acquire lock on m1                 |
>     increment counter                |
>   release lock on m1  ---------------/
> 
> I.e. for three successful locks you get one aborted transaction.
> This slows down thread 2 considerably.  Actually I'm surprised
> that thread 2 does not lose more.
> 
> What I do not understand is why thread 1 starts aborting
> transactions at all.  After all there is no conflict in the
> write sets of both threads.  The only aborts should occur because
> of interrupts.  If once the lock is used unfortunate timing
> conditions force the code to not switch back to elision (because
> one of the thread always uses the lock and forces the other one
> to lock too), that would explain the observed behaviour.  But
> frankly that looks to unlikely to me, unless I'm missing some
> important facts.

Yes maybe it's some kind of convoying issue.  Perhaps add some
performance counters to your glibc locally to see which kind of aborts
you get?  I fusing TLS and doing the observations in the nontxnal path,
it shouldn't interfere with the experiment too much.

> > > > > > What's the variance between the results from different test runs?  At
> > > > > > least having min/max/avg would be good, or perhaps the median, or
> > > > > > something like that.
> > > > > 
> > > > > It's a bit difficult to answer this without posting absolute
> > > > > numbers.  The range of measurements is large enough to be
> > > > > irritating in all test setups, even in the early morning when
> > > > > nobody but me works on the machine.
> > > > 
> > > > What do you mean by "irritating"?
> > > 
> > > By irritating I mean that I have no explanation for them.
> 
> The explanation for that is probably the machine architecture.
> Our system partition has eight cpus.  Six (or five?) cpus are on
> the same chip.  So, if some thread is executed by a cpu on a
> different chip, memory latency is much higher.  This effect could
> explain the constant factor I see.

Yes.  Can you try with pinning the threads to CPUs?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]