This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: shared-memory synchronization benchmarking in glibc


Reviving this thread because I think we can really benefit from this kind
of workload.

On 10/01/2017 13:17, Torvald Riegel wrote:
> I'd like to get feedback on how we build glibc's synchronization-related
> microbenchmarks, in particular regarding two aspects: (1) How to best
> emulate real-world workloads and (2) how to best fit them into the
> existing microbenchmark suite.

I think we will have multiple different workloads and even for some
architectures will focus on some aspect of the performance. So imho focusing
on creating a workload generator, in the same way fio [1] does for IO,
could make us create different kind of synthetic synchronization workloads.

[1] https://github.com/axboe/fio

> 
> Attached is what I've been playing with, which uses this for rwlock
> benchmarking.
> 
> 
> === Workloads
> 
> As you can see in the attached patch, I'm using random accesses to both
> a thread-shared and a thread-private memory region to simulate work done
> between calls to synchronization primitives.  For example, in the rwlock
> case, accesses to thread-private data between reader critical sections
> and accesses to the shared data within critical sections.  So, "work" is
> represented as invoking a Marsaglia-XOR-RNG followed by a memory
> accesses for a certain number of times.
> 
> This aims at giving the memory system some more stuff to do than if we
> were just, for example, spinning until a certain number of CPU cycles
> has passed.  Also, it allows one to introduce (cache) misses by varying
> the parameters.

I think for an initial synthetic workload it should give us some meaningful
data. Ideally I would like to have a way to select more different kind of 
workload, with or without different memory accesses.

> 
> Is this a reasonable way of emulating real-world workloads for
> concurrent code, from the perspective of your hardware?
> Any suggestions for how to improve this?
> Any suggestions regarding scenarios that should be covered?
> I'm looking for both scenarios that can significantly affect
> synchronization performance (eg, stressing the memory system more so
> that cache misses are harder to hide performance-wise) as well as
> scenarios that are representative of common real-world workloads (eg,
> should there be more dependencies such as would arise when navigating
> through linked data structures?).
> What do you use in your own testing?

I think we first should define exactly why kind of metric we are aiming
here. Are we measuring synchronization latency for an specific implementation
(semaphores, pthread r{d,w}lock) or just atomic operations? Are we aiming
for high or low contention? How many threads compared to cpuset? Critical
section that are cpu intensive or memory intensive? What about memory
topology and its different latencies?

That's why I think aiming to create a configurable workload generator
we can then get different metrics, such as a workload with high thread
creation that updates atomic variables or poll of threads with different
thread set size.

> 
> I'd also be interested in seeing microbenchmarks that show the
> assumptions we make.  For example, it would be nice to show cases in
> which the acquire hints on Power result in a performance gain, and where
> not (so the rest of the glibc developers get a better idea where to use
> the hint and where not to use it).
> 
> A first target for improved benchmarking would be lock elision
> performance, I believe.
> 
> 
> === Making it fit into the microbenchmark framework
> 
> I'd like to hear about ideas and suggestions for how to best integrate
> this into the microbenchmark suite.  The difference to many of the
> current tests is that it's not sufficient to just run a function in a
> tight loop.  We need to look at many more workloads (eg, long vs. short
> critical sections, different numbers of threads, ...).
> 
> That means we need to collect a lot more data, and present it in a
> better way.  In the past (for another project), I've used a simple
> mongodb database to store json objects representing benchmark results,
> and then ran queries across that fed into visualization (eg, using
> gnuplot).  Is this something that somebody else is already working on?

I do not have much experience on this kind of programs, but wouldn't a
simpler solution such tinydb [1] suffice to store an query the required
data? I thinking that requiring a full db deployment to run these
benchmarks is something too complex.

[1] https://pypi.python.org/pypi/tinydb

> 
> Doing that would mean collecting more information about when/where/... a
> benchmark run happened.  Any thoughts on that?

I think collecting the expected default information (kernel, compiler,
cpu) plus the topology from lstopo should be suffice.

> 
> It would also be useful if benchmarks were easier to run manually so
> that developers (and users) can experiment more easily with different
> workloads.  However, that would mean that we would need to be able to
> set benchmark parameters when invoking a benchmark, and not just be able
> to produce a benchmark binary that has them baked in.  Any thoughts or
> preferences how to best do that?

I think configurable workloads in a form of either ini or json file
should good enough.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]