This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v2] benchtests: Add malloc microbenchmark


On Tue, Jun 10, 2014 at 06:25:44AM +0530, Siddhesh Poyarekar wrote:
> On Tue, Jun 10, 2014 at 02:25:39AM +0200, OndÅej BÃlka wrote:
> > Problem is that this benchmark does not measure a multithread
> > performance well. Just spawning many threads does not say much, my guess
> > is that locking will quicky cause convergence to state where at each
> > core a thread with separate arena is running.
>
> How is that a bad thing?

It does not happen in wild, if you have many threads which mostly wait
for IO then you will not get that synchronization and there could be
conflicts.

> > > It does that in bench-malloc.  Or maybe I don't understand what you
> > > mean.
> > >
> > Thread A allocates memory, thread B deallocates. In current
> > implementation both will contend same lock.
> 
> OK, yes this is not measured, but it could be added on top of the
> current benchmark.
> 
There is limited amount of criteria that you consider for comparison,
some are more important than others and this is not very important. 

A more pressing criteria are measuring a running time of gcc
compilation. Currently if you replace allocator by tcmalloc then running
time improves by 10% which is huge. Decreasing gaps like that should be
top priority.

Then there are various pathological cases, for example following is
twice as slow when run in parallel than when you run threads
sequentialy.

#include <pthread.h>
#include <malloc.h>

void *reloc (void *x)
{
  int i;
  for (i=0;i<1000000;i++) x = realloc(x,42);
  return x;
}

pthread_t thread;
int main(int argv, char **argc){

  void *x = malloc (42);
  void *y = malloc (42);

  pthread_create (&thread, NULL, reloc,  x);
#ifdef CONTENDED 
  reloc(y);
  pthread_join(thread,NULL);
#else
  pthread_join(thread,NULL);
  reloc(y);
#endif
}

Then there are possible optimizations of cache locality, prefectching
and so on for which evaluation you need separate bencmark.

For patches there will almost always be overriding concern like this and
numbers from uncontended performance would be rigthfully ignored.


> > > > Running threads and measuring time after you join them measures a
> > > > slowest thread so at end some cores are idle.
> > > 
> > > How does that matter?
> > >
> > Because scheduling could make difference. Simple case, you have three
> > threads and two cores each takes unit time. If you run two threads in
> > parallel then total time is two units. If you run half of A and B then
> > half of B and C and then half of A and C then you could finish in 1.5
> > units.
> 
> OK, measuring at individual threads should alleviate this.
> 
> > No, there is simple reason for that. If you run a multithread program
> > you need to take number of cores into account?
> 
> Yes, the number of cores is in fact very important, thanks for pointing
> out.  Simply recording it in the results should be a good start.
>
It is that you should report correct mean time. This should raise a red
flag as 30 instructions are barely enough as single compare and swap
is minimaly 20 cycles and malloc now does three for fastbins.

> > > It would be a concern if we were measuring memory usage over time.
> > > Looking at just maximum usage does not have that problem.
> > >
> > No, its problem even with maximum usage, why do you thing it is
> > different?
> > 
> > When you do hundred allocations of size A, then hundred of size B, then
> > free all A and do hundred allocations of size C it is more memory
> > friendly than if you mixed allocations of A, B with frees from A.
> 
> OK, I misunderstood your point again, or maybe you can tell me if I
> did.  I was referring to maximum usage as a general measure given that
> allocation pattern is fixed while you're referring to comparison of
> maximum usage given different allocation patterns.  Given that the
> random number is not seeded, the test should create a constant
> allocation pattern.
> 
You do benchmarks to compare memory usage. When allocation pattern stays fixed
and algorithm to determine address is also fixed then you will get constant as 
you said. But then you should omit it as it does not matter.

And natural question when you change algorithm in way that changes
allocation pattern how would you check these. Here a specialized
benchmarks are necessary than one that does comparison badly and would
be ignored in most of cases.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]