This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Add malloc micro benchmark


On 12/01/2017 05:51 AM, Wilco Dijkstra wrote:
> Add a malloc micro benchmark to enable accurate testing of the
> various paths in malloc and free.  The benchmark does a varying
> number of allocations of a given block size, then frees them again.
> It does so for several block sizes and number of allocated blocks.
> Although the test is single-threaded, it also tests what happens
> when you disable single-threaded fast paths (ie. SINGLE_THREAD_P
> is false).
> 
> OK for commit?

High level:

This test is a long time coming and is a great idea.

My "big" question here is: What are we trying to model?

Do we want to prove that the single threaded optimizations you
added are helping a given size class of allocations?

You are currently modeling a workload that has increasing
memory size requests and in some ways this is an odd workload
that has high external fragmentation characteristics. For example
after allocating lots of 256 byte blocks we move on to 1024 byte
blocks, with the latter being unusable unless we coalesce.

We need to discuss what we want to model here, and I mention that
below in my larger comment.

I have no objection to modeling what you have here, but if we do
this then you need a big paragraph of text explaining why this
particular case is important to you at ARM, or to yourself as a
developer.

Design:

Overall this looks good.

I like that you test with just one thread and then again after
having created a thread.

I *wish* we could test main_arena vs. threaded arena, since they
have different code and behave differently e.g. sbrk vs. mmap'd
heap.

Implementation:

You need to make this robust against env vars changing malloc
behaviour. You should use mallopt to change some parameters.

> Typical output on an x64 box:
> {
>  "timing_type": "hp_timing",
>  "functions": {
>   "malloc": {
>    "malloc_block_size_0016": {
>     "st_num_allocs_0025_time": 53.5486,
>     "st_num_allocs_0100_time": 57.2553,
>     "st_num_allocs_0400_time": 57.3204,
>     "st_num_allocs_1000_time": 57.2059,
>     "mt_num_allocs_0025_time": 87.7903,
>     "mt_num_allocs_0100_time": 100.772,
>     "mt_num_allocs_0400_time": 103.827,
>     "mt_num_allocs_1000_time": 104.812
>    },
>    "malloc_block_size_0256": {
>     "st_num_allocs_0025_time": 78.3056,
>     "st_num_allocs_0100_time": 85.6392,
>     "st_num_allocs_0400_time": 91.5187,
>     "st_num_allocs_1000_time": 163.458,
>     "mt_num_allocs_0025_time": 115.925,
>     "mt_num_allocs_0100_time": 140.735,
>     "mt_num_allocs_0400_time": 152.044,
>     "mt_num_allocs_1000_time": 225.118
>    },
>    "malloc_block_size_1024": {
>     "st_num_allocs_0025_time": 113.705,
>     "st_num_allocs_0100_time": 103.79,
>     "st_num_allocs_0400_time": 479.029,
>     "st_num_allocs_1000_time": 634.228,
>     "mt_num_allocs_0025_time": 145.807,
>     "mt_num_allocs_0100_time": 151.157,
>     "mt_num_allocs_0400_time": 526.499,
>     "mt_num_allocs_1000_time": 687.357
>    },
>    "malloc_block_size_4096": {
>     "st_num_allocs_0025_time": 105.101,
>     "st_num_allocs_0100_time": 1640.23,
>     "st_num_allocs_0400_time": 2411.26,
>     "st_num_allocs_1000_time": 2641.56,
>     "mt_num_allocs_0025_time": 156.323,
>     "mt_num_allocs_0100_time": 1702.94,
>     "mt_num_allocs_0400_time": 2453,
>     "mt_num_allocs_1000_time": 2676.75
>    }
>   }
>  }
> }
> 
> Note something very bad happens for the larger allocations, there
> is a 25x slowdown from 25 to 400 allocations of 4KB blocks...

Keep in mind you are testing the performance of sbrk here. In a threaded
arena, the non-main_arena mmap's a 64MiB heap (on 64-bit) and then
draws allocations from it. So in some ways main_arena is expenseive,
but both have to pay a page-touch cost...

For each 4KiB block you touch the block to write the co-located metadata
and that forces the kernel to give you a blank page, which you then do 
nothing with. Then you repeat the above again.

For all other sizes you amortize the cost of the new page among
several allocations.

Do you have any other explanation?

At some point you will hit the mmap threshold and the cost of the
allocation will skyrocket as you have to call mmap.

> ChangeLog:
> 2017-12-01  Wilco Dijkstra  <wdijkstr@arm.com>
> 
> 	* benchtests/Makefile: Add malloc-simple benchmark.
> 	* benchtests/bench-malloc-simple.c: New benchmark.
> --
> 
> diff --git a/benchtests/Makefile b/benchtests/Makefile
> index d8681fce8cf399bc655f3f6a7717897eb9c30619..a4b2573cfa706bd6369063a995d512e0947c7bd5 100644
> --- a/benchtests/Makefile
> +++ b/benchtests/Makefile
> @@ -67,8 +67,10 @@ stdio-common-benchset := sprintf
>  
>  math-benchset := math-inlines
>  
> +malloc-benchset := malloc-simple
> +

OK.

>  benchset := $(string-benchset-all) $(stdlib-benchset) $(stdio-common-benchset) \
> -	    $(math-benchset)
> +	    $(math-benchset) $(malloc-benchset)

OK.

>  
>  CFLAGS-bench-ffs.c += -fno-builtin
>  CFLAGS-bench-ffsll.c += -fno-builtin
> @@ -86,6 +88,7 @@ $(addprefix $(objpfx)bench-,$(bench-math)): $(libm)
>  $(addprefix $(objpfx)bench-,$(math-benchset)): $(libm)
>  $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library)
>  $(objpfx)bench-malloc-thread: $(shared-thread-library)
> +$(objpfx)bench-malloc-simple: $(shared-thread-library)

OK.

>  
>  
>  
> diff --git a/benchtests/bench-malloc-simple.c b/benchtests/bench-malloc-simple.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..e786ddd9635f835b2f01b00a80f3cf0d2de82d48
> --- /dev/null
> +++ b/benchtests/bench-malloc-simple.c
> @@ -0,0 +1,152 @@
> +/* Benchmark malloc and free functions.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <pthread.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include "bench-timing.h"
> +#include "json-lib.h"
> +
> +#define NUM_ITERS 1000000
> +#define NUM_ALLOCS 4
> +#define NUM_SIZES  4
> +#define MAX_ALLOCS 1000
> +
> +typedef struct
> +{
> +  size_t iters;
> +  size_t size;
> +  int n;
> +  timing_t elapsed;
> +} malloc_args;
> +
> +static void
> +do_benchmark (malloc_args *args, int **arr)
> +{
> +  timing_t start, stop;
> +  size_t iters = args->iters;
> +  size_t size = args->size;
> +  int n = args->n;
> +
> +  TIMING_NOW (start);
> +
> +  for (int j = 0; j < iters; j++)
> +    {
> +      for (int i = 0; i < n; i++)
> +        arr[i] = malloc (size);
> +
> +      for (int i = 0; i < n; i++)
> +        free (arr[i]);
> +    }
> +
> +  TIMING_NOW (stop);
> +
> +  TIMING_DIFF (args->elapsed, start, stop);
> +}
> +
> +static malloc_args tests[2][NUM_SIZES][NUM_ALLOCS];
> +static int allocs[NUM_ALLOCS] = { 25, 100, 400, MAX_ALLOCS };
> +static size_t sizes[NUM_SIZES] = { 16, 256, 1024, 4096 };

In glibc we have:

tcache -> fastbins -> smallbins -> largbing -> unordered -> mmap

If you proceed through from small allocations to larger allocations
you will create chunks that cannot be used by future allocations.
In many cases this is a worst case performance bottleneck. The
heap will contain many 256 byte allocations but these cannot service
the 1024 bytes, that is unless consolidation has been run. So this
tests the consolidation as much as anything else, which might not
trigger because of the free thresholds required.

So what are we trying to model here?

If we want to look at the cost of independent size class allocations
then we need a clean process and allocate only a given size, and look
at performance across the number of allocations.

In which case we need to do:

* Spawn new process with size as an argument.
* Have the new process track performance at N allocations of the
  same size.
* Record result.
* Increase size.
* Repeat.

This way each new spawned subprocess is "clean" and we exercise
a particular size class of allocations.

I would also have much finer grained allocations by powers of 2.
2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4092 etc. You want
to see what happens for the allocations which are:

* Less than a chunk size.
* Fit in the tcache?
* Fit in the fastbins?
* Fit in the smallbins?
* Fit in the fastbins?
* Fit only in mmap? (allocation specifically larger than a
  MALLOC_MMAP_THRESHOLD_ you set).

Would this serve better to show that your single threaded malloc
changes were helpful for a given size class?

> +
> +static void *
> +dummy (void *p)
> +{
> +  return p;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  size_t iters = NUM_ITERS;
> +  int **arr = (int**) malloc (MAX_ALLOCS * sizeof (void*));
> +  unsigned long res;
> +

You need to use mallopt to make sure the user's environment
did not set MALLOC_MMAP_THRESHOLD_ to a value lower than your
maximum allocation size.

Similarly any other mallopt parameter you think is important
needs to be set.

If you spawned a clean subprocess you could clean the env var,
and that might actually be easier from a maintenance perspective.

> +  TIMING_INIT (res);
> +  (void) res;
> +
> +  for (int t = 0; t < 2; t++)
> +    for (int j = 0; j < NUM_SIZES; j++)
> +      for (int i = 0; i < NUM_ALLOCS; i++)
> +	{
> +          tests[t][j][i].n = allocs[i];
> +	  tests[t][j][i].size = sizes[j];
> +	  tests[t][j][i].iters = iters / allocs[i];
> +
> +	  /* Do a quick warmup run.  */
> +	  if (t == 0)
> +	    do_benchmark (&tests[0][j][i], arr);
> +	}
> +
> +  /* Run benchmark single threaded.  */
> +  for (int j = 0; j < NUM_SIZES; j++)
> +    for (int i = 0; i < NUM_ALLOCS; i++)
> +      do_benchmark (&tests[0][j][i], arr);
> +
> +  /* Create an empty thread so SINGLE_THREAD_P becomes false.  */
> +  pthread_t t;
> +  pthread_create(&t, NULL, dummy, NULL);
> +  pthread_join(t, NULL);
> +
> +  /* Repeat benchmark with SINGLE_THREAD_P == false.  */
> +  for (int j = 0; j < NUM_SIZES; j++)
> +    for (int i = 0; i < NUM_ALLOCS; i++)
> +      do_benchmark (&tests[1][j][i], arr);
> +
> +  free (arr);
> +
> +  json_ctx_t json_ctx;
> +
> +  json_init (&json_ctx, 0, stdout);
> +
> +  json_document_begin (&json_ctx);
> +
> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
> +
> +  json_attr_object_begin (&json_ctx, "functions");
> +
> +  json_attr_object_begin (&json_ctx, "malloc");
> +
> +  for (int j = 0; j < NUM_SIZES; j++)
> +    {
> +      char s[100];
> +      double iters2 = iters;
> +      sprintf (s, "malloc_block_size_%04ld", sizes[j]);
> +      json_attr_object_begin (&json_ctx, s);
> +
> +      for (int i = 0; i < NUM_ALLOCS; i++)
> +	{
> +	  sprintf (s, "st_num_allocs_%04d_time", allocs[i]);
> +	  json_attr_double (&json_ctx, s, tests[0][j][i].elapsed / iters2);
> +	}
> +
> +      for (int i = 0; i < NUM_ALLOCS; i++)
> +        {
> +          sprintf (s, "mt_num_allocs_%04d_time", allocs[i]);
> +          json_attr_double (&json_ctx, s, tests[1][j][i].elapsed / iters2);
> +        }
> +
> +      json_attr_object_end (&json_ctx);
> +    }
> +
> +  json_attr_object_end (&json_ctx);
> +
> +  json_attr_object_end (&json_ctx);
> +
> +  json_document_end (&json_ctx);
> +  return 0;
> +}
> 


-- 
Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]