This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: shared-memory synchronization benchmarking in glibc


On 01/27/2017 08:49 PM, Adhemerval Zanella wrote:
Reviving this thread because I think we can really benefit from this kind
of workload.

On 10/01/2017 13:17, Torvald Riegel wrote:
I'd like to get feedback on how we build glibc's synchronization-related
microbenchmarks, in particular regarding two aspects: (1) How to best
emulate real-world workloads and (2) how to best fit them into the
existing microbenchmark suite.
Ondřej Bílka wrote the dryrun framework (see post "Moving dryrun as separate project.", https://sourceware.org/ml/libc-alpha/2015-08/msg00400.html) which collects data of called string functions in real applications. Perhaps an extended version could collect timing information of calls to synchronization functions. Those information could be used to extract parameters for existing benchmarks or could be used to replay them.
This way, somebody could contribute real workload information.

I think we will have multiple different workloads and even for some
architectures will focus on some aspect of the performance. So imho focusing
on creating a workload generator, in the same way fio [1] does for IO,
could make us create different kind of synthetic synchronization workloads.

[1] https://github.com/axboe/fio

A workload generator sounds really good.
But perhaps we can start with a common framework which enables you to easily create a concrete benchmark without dealing with setting up threads and an alarm timer, ... .
I've found an older post from Paul E. Murphy:
"[RFC] benchtest: Add locking microbenchmarks"
(https://sourceware.org/ml/libc-alpha/2015-12/msg00540.html)
"It runs
a workload for a fixed amount of time and counts the number
of iterations achieved (throughput). Test workloads (TFUNC)
are tested against each locking function (LFUNC)."

In future a workload generator could use those existing "LFUNC" / "TFUNC"'s to generate a specific workload. Somebody could describe benchmarks where one thread uses a specific L/TFUNC and other threads are using different ones. This could be useful for creating bechmarks with rwlock_rdlock/rwlock_wrlock or lock/trylock.

Attached is what I've been playing with, which uses this for rwlock
benchmarking.

The struct shared_data_t contains pad arrays with 128 bytes.
Do you want to force the member rwlock into a separate cache line?
Then there should be the possibility for an architecture to specify
the size of a cache-line.
E.g. on s390 the cache line size is 256 bytes.
Perhaps this information can be read at runtime via sysconf(_SC_LEVEL1_DCACHE_LINESIZE) - if available.

max_threads could be determined dynamically, too?

I've saw the usage of __atomic_load_n, __atomic_store_n.
Shall we use the macros from include/atomic.h?

Would it make sense to call the used synchronisation function once before timing or link with -z now? Then we could omit the overhead of _dl_runtime_resolve while calling the function for the first time.

=== Workloads

As you can see in the attached patch, I'm using random accesses to both
a thread-shared and a thread-private memory region to simulate work done
between calls to synchronization primitives.  For example, in the rwlock
case, accesses to thread-private data between reader critical sections
and accesses to the shared data within critical sections.  So, "work" is
represented as invoking a Marsaglia-XOR-RNG followed by a memory
accesses for a certain number of times.

This aims at giving the memory system some more stuff to do than if we
were just, for example, spinning until a certain number of CPU cycles
has passed.  Also, it allows one to introduce (cache) misses by varying
the parameters.

I think for an initial synthetic workload it should give us some meaningful
data. Ideally I would like to have a way to select more different kind of
workload, with or without different memory accesses.


Is this a reasonable way of emulating real-world workloads for
concurrent code, from the perspective of your hardware?
Any suggestions for how to improve this?
Any suggestions regarding scenarios that should be covered?
I'm looking for both scenarios that can significantly affect
synchronization performance (eg, stressing the memory system more so
that cache misses are harder to hide performance-wise) as well as
scenarios that are representative of common real-world workloads (eg,
should there be more dependencies such as would arise when navigating
through linked data structures?).
What do you use in your own testing?

I think we first should define exactly why kind of metric we are aiming
here. Are we measuring synchronization latency for an specific implementation
(semaphores, pthread r{d,w}lock) or just atomic operations? Are we aiming
for high or low contention? How many threads compared to cpuset? Critical
section that are cpu intensive or memory intensive? What about memory
topology and its different latencies?

That's why I think aiming to create a configurable workload generator
we can then get different metrics, such as a workload with high thread
creation that updates atomic variables or poll of threads with different
thread set size.


I'd also be interested in seeing microbenchmarks that show the
assumptions we make.  For example, it would be nice to show cases in
which the acquire hints on Power result in a performance gain, and where
not (so the rest of the glibc developers get a better idea where to use
the hint and where not to use it).

A first target for improved benchmarking would be lock elision
performance, I believe.


=== Making it fit into the microbenchmark framework

I'd like to hear about ideas and suggestions for how to best integrate
this into the microbenchmark suite.  The difference to many of the
current tests is that it's not sufficient to just run a function in a
tight loop.  We need to look at many more workloads (eg, long vs. short
critical sections, different numbers of threads, ...).

That means we need to collect a lot more data, and present it in a
better way.  In the past (for another project), I've used a simple
mongodb database to store json objects representing benchmark results,
and then ran queries across that fed into visualization (eg, using
gnuplot).  Is this something that somebody else is already working on?
This means you could visualize the data of one benchmark-run and plot a graph with long vs. short, ... and / or such graphs of multiple benchmark-runs (e.g. different glibc versions)?

I do not have much experience on this kind of programs, but wouldn't a
simpler solution such tinydb [1] suffice to store an query the required
data? I thinking that requiring a full db deployment to run these
benchmarks is something too complex.

[1] https://pypi.python.org/pypi/tinydb


Doing that would mean collecting more information about when/where/... a
benchmark run happened.  Any thoughts on that?
Do you plan to collect those benchmark data on a public server or on somebody's local machine?

I think collecting the expected default information (kernel, compiler,
cpu) plus the topology from lstopo should be suffice.

Perhaps we could add the commit-ID of the used glibc-build.

It would also be useful if benchmarks were easier to run manually so
that developers (and users) can experiment more easily with different
workloads.  However, that would mean that we would need to be able to
set benchmark parameters when invoking a benchmark, and not just be able
to produce a benchmark binary that has them baked in.  Any thoughts or
preferences how to best do that?

I think configurable workloads in a form of either ini or json file
should good enough.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]