This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v2] benchtests: Add malloc microbenchmark


On Tue, Jun 10, 2014 at 05:30:15AM +0530, Siddhesh Poyarekar wrote:
> On Mon, Jun 09, 2014 at 10:33:26PM +0200, OndÅej BÃlka wrote:
> > Problem is that this benchmark does not measure a multithread
> > performance well. Just spawning many threads does not say much, my guess
> > is that locking will quicky cause convergence to state where at each
> > core a thread with separate arena is running.
> 
> How is that a bad thing?
> 
> > Also it does not measure hard case when you allocate memory in one
> > thread.
> 
> It does that in bench-malloc.  Or maybe I don't understand what you
> mean.
>
Thread A allocates memory, thread B deallocates. In current
implementation both will contend same lock.

> > I looked on multithread benchmark and it has additional flaws:
> > 
> > Big variance, running time varies around by 10% accoss iterations,
> > depending on how kernel schedules these. 
> 
> Kernel scheduling may not be the most important decider on variance.
> The major factor would be points at which the arena would have to be
> extended and then the performance of those syscalls.
>
It would if you measure correct thing which you do not. Did you profile
this benchmark so you are sure about that? Anyway you need a smaller
variance here.
 
> > Running threads and measuring time after you join them measures a
> > slowest thread so at end some cores are idle.
> 
> How does that matter?
>
Because scheduling could make difference. Simple case, you have three
threads and two cores each takes unit time. If you run two threads in
parallel then total time is two units. If you run half of A and B then
half of B and C and then half of A and C then you could finish in 1.5
units.

> > Bad units, when I run a benchmark then with one benchmark a mean is:
> > "mean": 91.605,
> > However when we run 32 threads then it looks that it speeds malloc
> > around three times:
> >  "mean": 28.5883,
> 
> Why do you think the units are bad?  Mean time for allocation of a
> single block in a single thread being slower than that of multiple
> threads may have something to do with the difference between
> performance on the main arena vs non-main arenas.  Performance
> difference between mprotect and brk or even their frequency or the
> difference in logic to extend heaps or finally, defaulting to mmap for
> the main arena when extension fails could be some factors.
> 
> That said, it may be useful to see how each thread performs
> separately.  For all we know, the pattern of allocation may somehow be
> favouring the multithreaded scenario.
> 
No, there is simple reason for that. If you run a multithread program
you need to take number of cores into account?


> > No, that was a benchmark that I posted which measured exactly what
> > happens at given sizes.
> 
> Post it again and we can discuss it?  IIRC it was similar to this
> benchmark with random sizes, but maybe I misremember.
> 
> > > However if you do want to show resource usage, then address space
> > > usage (VSZ) might show scary numbers due to the per-thread arenas, but
> > > they would be much more representative.  Also, it might be useful to
> > > see how address space usage scales with threads, especially for
> > > 32-bit.
> > >
> > Still this would be worse than useless as it would vary wildly from real
> > behaviour (for example it is typical that when there are allocations in
> > quick succession then they will likely also deallocated in quick
> > sucession.)  and that would cause us implement something that actually
> > increases memory usage.
> 
> It would be a concern if we were measuring memory usage over time.
> Looking at just maximum usage does not have that problem.
>
No, its problem even with maximum usage, why do you thing it is
different?

When you do hundred allocations of size A, then hundred of size B, then
free all A and do hundred allocations of size C it is more memory
friendly than if you mixed allocations of A, B with frees from A.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]