This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RFC: shared-memory synchronization benchmarking in glibc


I'd like to get feedback on how we build glibc's synchronization-related
microbenchmarks, in particular regarding two aspects: (1) How to best
emulate real-world workloads and (2) how to best fit them into the
existing microbenchmark suite.

Attached is what I've been playing with, which uses this for rwlock
benchmarking.


=== Workloads

As you can see in the attached patch, I'm using random accesses to both
a thread-shared and a thread-private memory region to simulate work done
between calls to synchronization primitives.  For example, in the rwlock
case, accesses to thread-private data between reader critical sections
and accesses to the shared data within critical sections.  So, "work" is
represented as invoking a Marsaglia-XOR-RNG followed by a memory
accesses for a certain number of times.

This aims at giving the memory system some more stuff to do than if we
were just, for example, spinning until a certain number of CPU cycles
has passed.  Also, it allows one to introduce (cache) misses by varying
the parameters.

Is this a reasonable way of emulating real-world workloads for
concurrent code, from the perspective of your hardware?
Any suggestions for how to improve this?
Any suggestions regarding scenarios that should be covered?
I'm looking for both scenarios that can significantly affect
synchronization performance (eg, stressing the memory system more so
that cache misses are harder to hide performance-wise) as well as
scenarios that are representative of common real-world workloads (eg,
should there be more dependencies such as would arise when navigating
through linked data structures?).
What do you use in your own testing?

I'd also be interested in seeing microbenchmarks that show the
assumptions we make.  For example, it would be nice to show cases in
which the acquire hints on Power result in a performance gain, and where
not (so the rest of the glibc developers get a better idea where to use
the hint and where not to use it).

A first target for improved benchmarking would be lock elision
performance, I believe.


=== Making it fit into the microbenchmark framework

I'd like to hear about ideas and suggestions for how to best integrate
this into the microbenchmark suite.  The difference to many of the
current tests is that it's not sufficient to just run a function in a
tight loop.  We need to look at many more workloads (eg, long vs. short
critical sections, different numbers of threads, ...).

That means we need to collect a lot more data, and present it in a
better way.  In the past (for another project), I've used a simple
mongodb database to store json objects representing benchmark results,
and then ran queries across that fed into visualization (eg, using
gnuplot).  Is this something that somebody else is already working on?

Doing that would mean collecting more information about when/where/... a
benchmark run happened.  Any thoughts on that?

It would also be useful if benchmarks were easier to run manually so
that developers (and users) can experiment more easily with different
workloads.  However, that would mean that we would need to be able to
set benchmark parameters when invoking a benchmark, and not just be able
to produce a benchmark binary that has them baked in.  Any thoughts or
preferences how to best do that?
commit 7554bd4a042d6dda0ec8d34224682c4b9124a8bd
Author: Torvald Riegel <triegel@redhat.com>
Date:   Tue Jan 10 13:27:44 2017 +0100

    Add synchronization microbenchmark.

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 81edf8a..60aef3d 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -25,7 +25,7 @@ bench-math := acos acosh asin asinh atan atanh cos cosh exp exp2 log log2 \
 	      modf pow rint sin sincos sinh sqrt tan tanh fmin fmax fminf \
 	      fmaxf
 
-bench-pthread := pthread_once
+bench-pthread := pthread_once multithread
 
 bench-string := ffs ffsll
 
diff --git a/benchtests/bench-multithread.c b/benchtests/bench-multithread.c
new file mode 100644
index 0000000..633deb0
--- /dev/null
+++ b/benchtests/bench-multithread.c
@@ -0,0 +1,319 @@
+/* Benchmark multithreaded operations.
+   Copyright (C) 2013-2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#define __USE_POSIX
+#include <errno.h>
+#include <math.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <unistd.h>
+#include <semaphore.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <signal.h>
+
+#include "bench-timing.h"
+//#include "json-lib.h"
+
+/* Benchmark parameters.  */
+static const int duration = 5;
+static const int max_threads = 4;
+/* Must be power-of-two values.  */
+static const size_t mem_shared_kb = 1;
+static const size_t mem_private_mb = 16;
+
+/* Thread data.  */
+typedef struct {
+  bool primary;
+  sem_t work_ready;
+  size_t mem_private_elems;
+  unsigned int *mem_private;
+  size_t mem_shared_elems;
+  unsigned int *mem_shared;
+  long int iter;
+  uint32_t rng_state;
+  bool work_done;
+} thread_data_t;
+
+/* Shared data.  */
+typedef struct {
+  size_t mem_elems;
+  unsigned int *mem;
+  char pad1[128];
+  pthread_rwlock_t rwlock;
+  char pad2[128];
+} shared_data_t;
+
+shared_data_t shared_data;
+sem_t threads_ready;
+bool finished;
+
+
+typedef struct {
+  pthread_t thread;
+  thread_data_t* td;
+} thread_handle_t;
+thread_handle_t* threads;
+
+void* xmalloc (size_t s)
+{
+  void *ptr = malloc (s);
+  if (ptr == NULL)
+    {
+      fprintf (stderr, "malloc failed\n");
+      exit (1);
+    }
+  return ptr;
+}
+
+static uint32_t pseudo_rng (uint32_t *state)
+{
+  *state ^= *state << 13;
+  *state ^= *state >> 17;
+  *state ^= *state << 5;
+  return *state;
+}
+
+static void shared_work_start (void)
+{
+  pthread_rwlockattr_t attr;
+  pthread_rwlockattr_init (&attr);
+  pthread_rwlockattr_setpshared (&attr, PTHREAD_PROCESS_PRIVATE);
+  pthread_rwlockattr_setkind_np (&attr, PTHREAD_RWLOCK_PREFER_READER_NP);
+  pthread_rwlock_init (&shared_data.rwlock, &attr);
+  pthread_rwlockattr_destroy (&attr);
+}
+
+static void shared_work_stop (void)
+{
+  pthread_rwlock_destroy (&shared_data.rwlock);
+}
+
+/* If writes is 0, reads_per_write reads are run once nonetheless.  */
+static void access_mem (unsigned int *mem, size_t size,
+    size_t writes, size_t reads_per_write, uint32_t *rng)
+{
+  do
+    {
+      unsigned int sum = 0;
+      for (int j = 0; j < reads_per_write; j++)
+	{
+	  size_t idx = pseudo_rng (rng) & (size - 1);
+	  sum += __atomic_load_n (mem + idx, __ATOMIC_RELAXED);
+	}
+      if (writes > 0)
+	{
+	  size_t idx = pseudo_rng (rng) & (size - 1);
+	  __atomic_store_n (mem + idx, sum, __ATOMIC_RELAXED);
+	  writes--;
+	}
+    }
+  while (writes > 0);
+}
+
+static void do_work (thread_data_t *td)
+{
+  while (!__atomic_load_n (&td->work_done, __ATOMIC_RELAXED))
+    {
+      access_mem (td->mem_private, td->mem_private_elems, 0, 60,
+	  &td->rng_state);
+
+      pthread_rwlock_rdlock (&shared_data.rwlock);
+      access_mem (td->mem_shared, td->mem_shared_elems, 3, 20,
+	  &td->rng_state);
+      pthread_rwlock_unlock (&shared_data.rwlock);
+
+      td->iter++;
+    }
+}
+
+static void* thread_func (void* tdptr)
+{
+  /* Allocate data in the thread so that we use local memory.  */
+  thread_data_t *td = xmalloc (sizeof (thread_data_t));
+  *((thread_data_t**) tdptr) = td;
+  td->primary = (&threads[0].td == tdptr);
+  sem_init (&td->work_ready, 0, 0);
+  td->mem_private_elems = mem_private_mb * 1024 * 1024
+      / sizeof (*td->mem_private);
+  td->mem_private = xmalloc (sizeof (*td->mem_private)
+      * td->mem_private_elems);
+  td->mem_shared_elems = shared_data.mem_elems;
+  td->mem_shared = shared_data.mem;
+  td->rng_state = 1692732839;
+
+  if (td->primary)
+    {
+      /* Wait for all threads to complete initialization.  */
+      for (int i = 1; i < max_threads; i++)
+	while (sem_wait (&threads_ready) != 0) ;
+
+      /* Run all the work.  */
+      for (int t = 1; t <= max_threads; t++)
+	{
+	  // TODO update work description
+	  shared_work_start ();
+	  timing_t start, stop, real_duration;
+
+	  /* Tell participating threads to start and start our own work.  */
+	  TIMING_NOW (start);
+	  for (int i = 1; i < t; i++)
+	    {
+	      threads[i].td->work_done = false;
+	      sem_post (&threads[i].td->work_ready);
+	    }
+
+	  /* Start the alarm and our own work.  */
+	  td->work_done = false;
+	  td->iter = 0;
+	  alarm (duration);
+	  do_work (td);
+	  TIMING_NOW (stop);
+
+	  /* Wait till we're done, and gather results.  */
+	  long int iter = td->iter;
+	  for (int i = 1; i < t; i++)
+	    {
+	      while (sem_wait (&threads_ready) != 0) ;
+	      iter += threads[i].td->iter;
+	    }
+
+	  TIMING_DIFF (real_duration, start, stop);
+	  printf("threads=%d  iter/s=% 15ld  timing_t=%.0lf\n", t,
+	      iter / duration, (double)real_duration);
+
+	  shared_work_stop ();
+	}
+      /* We're done.  Tell all others.  */
+      finished = true;
+      for (int i = 1; i < max_threads; i++)
+	sem_post (&threads[i].td->work_ready);
+    }
+  else
+    {
+      /* We are ready.  Do what we're told to do.  */
+      while (1)
+	{
+	  sem_post (&threads_ready);
+	  while (sem_wait (&td->work_ready) != 0) ;
+	  if (finished)
+	    break;
+	  td->iter = 0;
+	  do_work (td);
+	}
+    }
+  return NULL;
+}
+
+static void setup_threads (void)
+{
+  threads = xmalloc (sizeof (thread_handle_t*) * max_threads);
+  sem_init (&threads_ready, 0, 0);
+  threads[0].thread = pthread_self ();
+  for (int i = 1; i < max_threads; i++)
+    {
+      if (pthread_create (&threads[i].thread, NULL, thread_func,
+	  &threads[i].td) != 0)
+	{
+	  fprintf (stderr, "pthread_create failed\n");
+	  exit (1);
+	}
+    }
+}
+
+
+static void
+alarm_handler (int signum)
+{
+  for (int i = 0; i < max_threads; i++)
+    __atomic_store_n (&threads[i].td->work_done, true, __ATOMIC_RELAXED);
+}
+
+int
+main (int argc, char **argv)
+{
+//  json_ctx_t json_ctx;
+  struct sigaction act;
+
+  if (!__atomic_always_lock_free (sizeof (unsigned int), 0))
+    {
+      printf ("no fast atomics available\n");
+      exit (1);
+    }
+  if (((mem_shared_kb & (mem_shared_kb - 1)) != 0)
+      || ((mem_private_mb & (mem_private_mb - 1)) != 0))
+    {
+      printf ("mem sizes are not power-of-two values\n");
+      exit (1);
+    }
+
+//  json_init (&json_ctx, 0, stdout);
+//
+//  json_document_begin (&json_ctx);
+//
+//  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
+//
+//  json_attr_object_begin (&json_ctx, "functions");
+//
+//  json_attr_object_begin (&json_ctx, "malloc");
+//
+//  json_attr_object_begin (&json_ctx, "");
+
+
+  memset (&act, 0, sizeof (act));
+  act.sa_handler = &alarm_handler;
+
+  sigaction (SIGALRM, &act, NULL);
+
+  // TODO what's res?
+  unsigned long res;
+  TIMING_INIT (res);
+  (void) res;
+
+  shared_data.mem_elems = mem_shared_kb * 1024 / sizeof (*shared_data.mem);
+  shared_data.mem = xmalloc (sizeof (*shared_data.mem)
+      * shared_data.mem_elems);
+
+  setup_threads();
+  thread_func (&threads[0].td);
+
+//  json_attr_double (&json_ctx, "duration", d_total_s);
+//  json_attr_double (&json_ctx, "iterations", d_total_i);
+//  json_attr_double (&json_ctx, "time_per_iteration", d_total_s / d_total_i);
+//  json_attr_double (&json_ctx, "max_rss", usage.ru_maxrss);
+//
+//  json_attr_double (&json_ctx, "threads", num_threads);
+//  json_attr_double (&json_ctx, "min_size", MIN_ALLOCATION_SIZE);
+//  json_attr_double (&json_ctx, "max_size", MAX_ALLOCATION_SIZE);
+//  json_attr_double (&json_ctx, "random_seed", RAND_SEED);
+//
+//  json_attr_object_end (&json_ctx);
+//
+//  json_attr_object_end (&json_ctx);
+//
+//  json_attr_object_end (&json_ctx);
+//
+//  json_document_end (&json_ctx);
+
+  return 0;
+}

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]