This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lock elision test results


On Tue, Jul 02, 2013 at 02:18:04PM +0200, Torvald Riegel wrote:
> On Fri, 2013-06-14 at 12:26 +0200, Dominik Vogt wrote:
> > thread 1:
> > 
> >   barrier
> >   repeat <n> times
> 
> We need to know how large n and m (below) are.  At least a rough range.

<n> is a number large enough to guarantee a total run time of
several seconds.  It really does not matter as long as it's
sufficiently large to level out one time effects in the test.  The
test was run with <n> = 10000.

<m> is large enough to cause serious work to be done in the loop
and small enough to only cause a marginal number of aborts caused
by scheduling interrupts.  The actual choice for <m> in the test
was 100000.

> >     lock m1
> >     lock m2
> >     increment c1
> >     unlock m1
> >     increment c2
> >     repeat <m> times
> >       waste a minimal amount of cpu
> >     unlock m2
> >     signal that thread 1 has finished its work
> 
> I suppose that this isn't part of the loop (but it's indented the same
> way as the loop body)?

Of course.  I hope I get the clearance to post source code soon,
so there's no room for typing errors when describing test cases.

> >   barrier
> > 
> > thread 2:
> > 
> >   barrier
> >   get start timestamp
> >   while thread 1 has not finished
> >     lock m1
> >     increment c3
> >     unlock m2
> >   get end timestamp
> > 
> > Performance is measured in loops of thread 2 divided by the time
> > taken.
> 
> Why just thread 2's time?  Without looking at thread 1's performance too
> we can't distinguish throughput from fairness effects.

As the threads are synchronized they take the same amount of time.
Speaking about the actual test result:

1. In (4), thread 1 has a high abort ratio, i.e. it needs lots of
   additional cpu cycles and the whole test runs much longer than
   it would without aborts.
2. Thread 2 does not benefit much from the additional run time,
   the number of iterations is roughly the same as without
   elision.

> > Test execution
> > --------------
> > 
> > The test is run ten times each with four different versions and
> > setups of glibc:
> > 
> > (1) current glibc without elision patchs (2506109403de)
> > (2) glibc-2.15
> > (3) current glibc (1) plus elision patchs, GLIBC_PTHREAD_MUTEX=none
> > (4) current glibc (1) plus elision patchs, GLIBC_PTHREAD_MUTEX=elision
> > 
> > The best results of all runs for each glibc setup are compared.
> 
> What's the variance between the results from different test runs?  At
> least having min/max/avg would be good, or perhaps the median, or
> something like that.

It's a bit difficult to answer this without posting absolute
numbers.  The range of measurements is large enough to be
irritating in all test setups, even in the early morning when
nobody but me works on the machine.  The measurements differ about
5% with glibc-2.15 and less than that in the other tests.

> > The result for (1) is the reference (i.e. 100%).  Higher values
> > mean higher relative performance.
> > 
> > Result
> > ------
> > 
> > (1) unpatched  : 100.00%
> > (2) old glibc  : 101.83%
> > (3) elision off:  77.87%
> 
> What causes this 25% overhead?

Unknown yet.

> Does it also occur with just a single thread,

Test 1 is meaningless with just one thread, but I have another
test case 5 on my disk that tests single thread performance.
With the elision patches but elision turned off, performance is
about 94% to 95%, and with elision turned on it's about 35% to 48%
of the unpatched performance.  (The critical section is minimal.)

> or is it a difference in cache misses?

To be investigated.

> > (4) elision on :  29.37%
> > 
> > The abort ratio in (4) is >= 75% on thread 1
> 
> Why does it abort?

That remains to be investigated further.  With the hardware
sampling on z it's difficult to get precise results about
transactions because a sample needs to hit an abort exactly at
the right cycle.  I see only a couple aborts in the trace - too
few to draw conclusions - but that does not mean they are not
there, only that it's difficult to sample them.

I'm working on transaction profiling patches for gcc, but these
will introduce a big overhead for the given test because the
routines to trace tbegin, tend and aborts take much more time than
the actual test code.

> Did you experiment with different tuning params (ie,
> the fields in elision_conf)?

No.

> >  and < 1% on thread 2.
> 
> Did thread 2 use elision often, or not?

!!!!!!!!!!!!!!!!!!!!!!!!!
That's difficult to count without disturbing the test, but 

> If it did, you almost never
> abort, and you just measure thread 2's performance, then why the
> overhead?

> Are aborts extremely costly on your architecture?

Yes.

> Do you get a lot more cache misses somewhere else?

See above, I need better 

> What's the performance of running just thread 2 without thread 1?  IOW,
> how large are the single-thread overheads of elision on your arch/impl?

See above.

> Do you have tests with larger critical sections?  Right now your
> critical sections, esp. the one in thread 2, are extremely small.

The current set of tests is meant to check corner cases of the
implementation.  It's completely legitimate that you ask for
results with larger critical sections, and of course I have run
such tests.  The results are that elision basically works, and
that synthetic micro benchmarks with large critical sections tell
you absolutely nothing about real life performance.  It's possible
to write code that benefits a lot from elision, and it's also
possible to write code that is harmed equally.

If you can make a sketch of a relevant test case with large
critical sections, I'm happy to implement and run it, but at the
moment I've no idea how a relevant test look like.

> > Test 2 (nested locks)
> > ...

I'll split the discussion about the other tests separate messages.

Ciao

Dominik ^_^  ^_^

-- 

Dominik Vogt
IBM Germany


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]