This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v2] Add malloc micro benchmark


On 01/02/2018 10:20 AM, Wilco Dijkstra wrote:
> Carlos O'Donell wrote:
> 
>> If you have a pattern of malloc/free of *similar* sized blocks, then
>> it overflows the sized bin in the tcache, with other size bins remaining
>> empty. The cache itself does not dynamically reconfigure itself to consume
>> X MiB or Y % of RSS, instead it uses a simple data structure to contain
>> a fixed number of fixed size blocks.
>>
>> Therefore I agree, that enhancing the core data structure in tcache may
>> result in better overall performance, particularly if we got rid of the
>> fixed bin sizes and instead found a way to be performant *and* keep a
>> running total of consumption.
> 
> Well it could keep track of sum of all block sizes and limit that. That would
> be better than a fixed limit on number of blocks in each bin. 

Yes, in practice what we want to enforce is % of total RSS, or fixed X MiB of
RSS in thread caches, and we want that value to be a per-thread variable that can
be changed for each thread depending on the thread's workload.

e.g.

Thread 1, 2, and 3 each need 5% of RSS for their workloads.

Threads 4, and 5, need 25% of RSS for their workloads.

Obviously, if RSS is very low, then our 32MiB/64MiB starting heaps may be unusable,
but could also be tuned, it's a decision that can only be made once at startup
because we use this embedded assumption for pointer arithmetic to find the arena
structure address. It could be changed, but this would increase costs.

>> Likewise *all* of malloc needs to be moved to a better data structure than
>> just linked lists. I would like to see glibc's malloc offer a cacheing
>> footprint of no more than Y % of RSS available, and let the user tweak that.
>> Currently we just consume RSS without much regard for overhead. Though this
>> is a different case than than what you are talking about, the changes are
>> related via data-structure enhancements that would benefit both cases IMO.
> 
> What kind of datastructure do you have in mind? Small blocks could be
> allocated together in pages. This would avoid the per-block overhead and
> changes the linked list into a bitmap scan. However there aren't that many
> alternatives. I've written first-fit allocators using autobalancing trees, but
> walking the tree is expensive due to being not having good locality.

Keep in mind we are deviating now from the topic at hand, but I'll lay out two
improvements, one of which you already touched upon.

(1) Small blocks cost too much RSS.

I have had customers hit terrible corner cases with C++ and small blocks which
see ~50% waste due to colocated metadata e.g. new of 13-byte objects.

Getting smaller allocations working together for the small blocks would be a big win,
and using some kind of bitmap would be my suggested solution. This requires a data
structure to track the parent allocation, bitmaps, and sub-allocations. We have some
of these data structures in place, but no easy generic way to use them (I think Florian's
pushing of more general data structure use in glibc is the right way to go e.g. dynarray).

I think that for blocks smaller than the fundamental language types (which require
malloc to have 16-byte alignment) we do not have to return sufficiently aligned
memory. For example if you allocate a 3-byte block or a 13-byte block, you cannot
possibly put a 16-byte long double there, nor can you use that for a stack block,
so it's a waste to guarantee alignment.

* Group small allocations.
* Violate the ABI and use < MALLOC_ALIGNMENT sized alignments for sub-group members.

(2) Track total memory used and free back based on more consistent heuristics.

Again, we have customers who complain that glibc's malloc is bad for container
or VM workloads with tight packing because it just keeps allocating more heaps.

We could do better to track free page runs, and when we exceed some %RSS
threshold, free back larger runs to the OS, something which malloc_trim()
already does but in a more costly manner by walking the whole arena heaps from
start to end.

This isn't explicitly about performance.

(3) Make arena's truly a per-cpu data structure.

We do not want per-thread arenas.

We want per-thread caches (avoids locks)

We want per-cpu arenas (avoid NUMA issues, and get better locality)

The per-thread cache does the job of caching the thread-local requests
and avoiding locks.

The per-cpu arenas do the job of containing cpu-local memory, and handling
requests for the threads that reside on that CPU, either pinned, or there
temporarily.

Memory being returned should go back to the cpu that the memory came from.

The best we have done today is to have the arenas scale with the number of
CPUs, and let them fall where they may across the cores, and let the
numa page scheduler move the memory around them to minimize cost.

The best way would be to use something like restartable sequences to
get the CPU #, select the arena, and get memory from there, and then
choose another arena next time perhaps.

(4) Distribute threads among arenas.

We have no data structure for balancing arena usage across threads.

We have seen workloads where the maximum number of arenas is allocated,
but threads don't move smoothly across arenas, instead the first unlocked
arena is chosen, and that might not be the best from a memory locality
perspective (see (3)).

> Add a malloc micro benchmark to enable accurate testing of the
> various paths in malloc and free.  The benchmark does a varying
> number of allocations of a given block size, then frees them again.
> 
> It tests 4 different scenarios: single-threaded using main arena,
> multi-threaded using thread-arena, main arena with SINGLE_THREAD_P
> false and main arena with the tcache count set larger than the
> default.  To enable this, add support for M_TCACHE_COUNT in mallopt.
> 
> OK for commit?
> 
> ChangeLog:
> 2018-01-02  Wilco Dijkstra  <wdijkstr@arm.com>
> 
> 	* benchtests/Makefile: Add malloc-simple benchmark.
> 	* benchtests/bench-malloc-simple.c: New benchmark.
> 	* malloc/malloc.h (M_TCACHE_COUNT): Add new define.
> 	* malloc/malloc.c (__libc_mallopt): Handle M_TCACHE_COUNT.

I don't like the creation of a new public ABI via M_TCACHE_COUNT through mallopt.

I suggest splitting these tests apart and using the tunable env var to touch this.

See my other email.

-- 
Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]