This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Lock elision test results
- From: Dominik Vogt <vogt at linux dot vnet dot ibm dot com>
- To: libc-alpha at sourceware dot org
- Date: Tue, 9 Jul 2013 08:03:10 +0200
- Subject: Re: Lock elision test results
- References: <20130614102653 dot GA21917 at linux dot vnet dot ibm dot com> <1372767484 dot 22198 dot 4505 dot camel at triegel dot csb> <20130703065355 dot GA4522 at linux dot vnet dot ibm dot com> <1372849432 dot 14172 dot 20 dot camel at triegel dot csb> <20130703120753 dot GA8416 at linux dot vnet dot ibm dot com> <1372857484 dot 14172 dot 195 dot camel at triegel dot csb> <20130704092941 dot GA12864 at linux dot vnet dot ibm dot com> <1372937162 dot 14172 dot 1030 dot camel at triegel dot csb>
- Reply-to: libc-alpha at sourceware dot org
On Thu, Jul 04, 2013 at 01:26:02PM +0200, Torvald Riegel wrote:
> On Thu, 2013-07-04 at 11:29 +0200, Dominik Vogt wrote:
> > The same number of iterations per second, no matter of what value
> > <n> is.
>
> So no matter how long thread2 has time before it's signaled to stop by
> thread 1, it always gets the same amount of work done? That doesn't
> seem right. Eventually, it should get more work done.
Not the same amount of work but the same amount of work *per
second*.
> > So, while thread 1 does one iteration of
> >
> > waste a minimal amount of cpu
...
> >
> > thread 2 does
> >
> > lock m1
> > increment c3
> > unlock m2
> >
> > It looks like one iteration of thread 2 takes bout six times more
> > cpu that an iteration of thread 1 (without elision)
>
> But it doesn't look like it should because it's not doing more work; do
> you know the reason for this?
Well, thread 1 just increments a counter in a loop while thread 2
additionally calls pthead_mutex_lock() and -_unlock(). That could
certainly explain the factor.
> > and with
> > alision it takes about 14 times more cpu.
>
> It might be useful to have a microbenchmark that tests at which length
> of critical sections the overhead of the transactional execution is
> amortized (e.g., if we assume that the lock is contended, or is not, or
> with a certain probability).
That alway depends on the abort ratio of the transaction and thus
on what other threads are doing.
> We need to model performance in some way
> to be able to find robust tuning parameters, and find out which kind of
> tuning and tuning input we actually need. Right now we're just looking
> at the aborts; it seems that for z at least, the critical section length
> should also be considered.
Make a suggestion for another test and I'll happily hack and run
it.
> > > At which point we should have the option to do the same as
> > > in case (3); thus, the difference is surprising to me. Do you have any
> > > explanations or other guesses?
> >
> > Hm, with the default tuning values (three attempts with elision
> > and three with locks), *if* thread 1 starts using locks, thread 2
> > would
> >
> > try to elide m1 <------------------\
> > begin a transaction |
> > *futex != 0 ==> forced abort |
> > acquire lock on m1 |
> > increment counter |
> > release lock on m1 |
> > acquire lock on m1 |
> > increment counter |
> > release lock on m1 |
> > acquire lock on m1 |
> > increment counter |
> > release lock on m1 ---------------/
> >
> > I.e. for three successful locks you get one aborted transaction.
> > This slows down thread 2 considerably. Actually I'm surprised
> > that thread 2 does not lose more.
> >
> > What I do not understand is why thread 1 starts aborting
> > transactions at all. After all there is no conflict in the
> > write sets of both threads. The only aborts should occur because
> > of interrupts. If once the lock is used unfortunate timing
> > conditions force the code to not switch back to elision (because
> > one of the thread always uses the lock and forces the other one
> > to lock too), that would explain the observed behaviour. But
> > frankly that looks to unlikely to me, unless I'm missing some
> > important facts.
>
> Yes maybe it's some kind of convoying issue. Perhaps add some
> performance counters to your glibc locally to see which kind of aborts
> you get? I fusing TLS and doing the observations in the nontxnal path,
> it shouldn't interfere with the experiment too much.
My local gcc patch will soon be in a useable state so that I can
get some decent profiling information, but as instrumentation is
somewhat invasive (at least outside transactions), there's no
guarantee that it can pin down what's happening.
As a side note, using Tls in the lock elision patch (for thread
debugging) considerably slows down the pthread_mutex_... functions
as fetching the Tls pointer is slow.
> > The explanation for that is probably the machine architecture.
> > Our system partition has eight cpus. Six (or five?) cpus are on
> > the same chip. So, if some thread is executed by a cpu on a
> > different chip, memory latency is much higher. This effect could
> > explain the constant factor I see.
>
> Yes. Can you try with pinning the threads to CPUs?
Not on the shared machine I have not, but I can get a slot on a
dedicated testing machine where this can be controlled. This will
take some time and effort for preparation though, and before I get
the slot, I need to know what I want to test exactly.
Ciao
Dominik ^_^ ^_^
--
Dominik Vogt
IBM Germany