This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] Lock elision implementation guidelines
- From: Dominik Vogt <vogt at linux dot vnet dot ibm dot com>
- To: libc-alpha at sourceware dot org
- Date: Mon, 10 Jun 2013 13:43:46 +0200
- Subject: Re: [RFC] Lock elision implementation guidelines
- References: <1360527652 dot 3065 dot 11521 dot camel at triegel dot csb> <1370618459 dot 16968 dot 11300 dot camel at triegel dot csb>
- Reply-to: vogt at linux dot vnet dot ibm dot com
> Focus on PNT, allow E
>
> Enable LE conservatively. Provide tuning knobs that are not part of the
> ABI.
> Pro: Hopefully little harm done by LE implementation deficiencies (need
> to be conservative with semantics, but performance tuning doesn't need to
> perfect right away).
> Con: No stable way to tune LE until a later time.
The attached program suffers from a massive performance loss
caused by lock elision. It runs two threads; the first one has a
repeated, big workload, while the second one just increases a
counter repeatedly. Two mutexes are used.
(The test was run on a zEC12 with the elision patches ported to z
architecture.)
thread 1:
---------
for (...)
lock A
lock B
increment a counter 1
unlock A
handle big workload
increment counter 3
unlock B
thread 2:
---------
forever
lock A
increment a counter 2
lock B
(For completeness, there might be a third thread that runs at very
low frequency, and uses lock B to query counter 3. That thread
might not need to be very responsive. I.e. lock B is not
superfluous.)
In a real life test, the runtime of the program with elision is
almost identical to the runtime without elision and just locks.
But in the same time I get 1247 million increments of counter 2
without elision but just 660 million increments _with_ elision.
The explanation for this is that because of lock collisions the
mutex A is eventually switched to using a real lock instead of an
elided lock. Thread 1 then elides lock B. From this point I'm
not entirely sure what happens as I'm new to the nptl code. But
either thread 2 waits on lock A until thread 1 releases lock B, or
as thread 2 tries to lock A, it aborts the transaction in thread
1.
The point is: For certain program logic, lock elision can be
very bad for performance.
P.S.: Test run with default tuning values; I think it switches
to real locks after three failed transactions, and back to
transactions after three real locks. I'm quite sure that given
_any_ set of tuning parameters, I can write you a test program
that has similarly bad performance using elision.
Ciao
Dominik ^_^ ^_^
--
Dominik Vogt
IBM Germany
/* ---------------------------- included header files ---------------------- */
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
/* ---------------------------- local definitions -------------------------- */
#define NUM_PASSES 100000
#define NUM_PASSES_2 100000
/* ---------------------------- local macros ------------------------------- */
/* ---------------------------- imports ------------------------------------ */
/* ---------------------------- included code files ------------------------ */
/* ---------------------------- local types -------------------------------- */
__attribute__ ((aligned(256))) pthread_mutex_t m1 = PTHREAD_MUTEX_INITIALIZER;
__attribute__ ((aligned(256))) pthread_mutex_t m2 = PTHREAD_MUTEX_INITIALIZER;
pthread_barrier_t barrier;
long c1 = 0;
long c1b = 0;
long c2 = 0;
int t1_done = 0;
volatile long j;
/* ---------------------------- forward declarations ----------------------- */
/* ---------------------------- local variables ---------------------------- */
/* ---------------------------- exported variables (globals) --------------- */
/* ---------------------------- local functions ---------------------------- */
static void *_thread_do_work_1(void *ptr)
{
long i;
pthread_barrier_wait(&barrier);
for (i = 0; i < NUM_PASSES; )
{
pthread_mutex_lock(&m1);
pthread_mutex_lock(&m2);
c1++;
pthread_mutex_unlock(&m1);
c1b++;
i++;
{
for (j = 0; j < NUM_PASSES_2; )
{
j++;
}
}
pthread_mutex_unlock(&m2);
}
t1_done = 1;
pthread_barrier_wait(&barrier);
pthread_exit((void *)0);
}
static void *_thread_do_work_2(void *ptr)
{
pthread_barrier_wait(&barrier);
while (t1_done == 0)
{
pthread_mutex_lock(&m1);
c2++;
pthread_mutex_unlock(&m1);
}
pthread_barrier_wait(&barrier);
pthread_exit((void *)0);
}
/* ---------------------------- interface functions ------------------------ */
int main(int argc, char **argv)
{
int rc;
rc = pthread_barrier_init(&barrier, NULL, 2);
if (rc != 0)
{
perror("pthread_barrier_init");
exit(rc);
}
{
pthread_t tid[2];
int i;
pthread_create(&tid[0], NULL, _thread_do_work_1, NULL);
pthread_create(&tid[1], NULL, _thread_do_work_2, NULL);
for (i = 0; i < 2; i++)
{
void *retval;
pthread_join(tid[i], &retval);
}
}
printf("c1 %ld/%ld, c2 %ld, j %ld\n", c1, c1b, c2, j);
return 0;
}