This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Possible inline malloc alternatives: bitmap

From: Carlos O'Donell <carlos at redhat dot com>
To: Ondřej Bílka <neleai at seznam dot cz>, libc-alpha at sourceware dot org
Date: Fri, 2 Mar 2018 08:27:50 -0800
Subject: Re: Possible inline malloc alternatives: bitmap
Authentication-results: sourceware.org; auth=none
References: <20180301213318.GB3062@domone>

On 03/01/2018 01:33 PM, Ondřej Bílka wrote:
> Like I mentioned in other thread I plan to make improved malloc+friends.
> It will be split to several aspects.

Great to hear that!

Please keep in mind that at Red Hat we are working on 3 relatively new
things for malloc:

* Trace functionality (part of our whole-system trace work)
* Replay functionality (our simulator)
* Heap dumping functionality

Only the heap dumper will be impacted by any work done by you, but this
is OK, all that we ask is that you consider including us in your work
so we can stay abreast of the changes you are thinking about and be
able to comment on how the functionality might impact accurate dumping.

> This one is about option to inline small size allocations which would
> give same performance as using memory pool.

What you describe below is a page-based allocator with a bit-map of the
in-use allocations. I have no objection to such enhancements for glibc's
malloc. However, these enhancements, like the tcache ones, must be driven
by collected workloads which can be run by others to verify the work yields
performance benefits across their machines.

When we did the tcache work we collected dozens of workloads, ran them
through trace, and re-simulated them on various architectures to look at
the performance characteristics of tcache.

Similarly any future work should be driven by a workloads and performance
measurements against those workloads. We can provide the workload data for
the workloads we used for tcache if you need them.

It was the workloads for example that showed that even with tcache, fastbins
was *still* a win from a performance perspective.

The harder work that I face today is not about performance though, the harder
work, and I would hope you could help us on this, is *limiting* RSS consumption
and improving the core algorithms used in malloc to be able to limit RSS
without an immense cost to performance.

Developers want to deploy memory limited systems and glibc's malloc does not
always consider RSS costs. Some issues around this are:

* fastbins prevent free'ing down of heap from top.
* fastbins can grow unbounded.
* heaps can only be freed from top down (unless you call malloc_trim() which
  can find sub-pages in the heap to free back).

In summary:

Application mixing of chunks of disparate lifetimes results in fragmented
heaps for heap-based allocator, while page-based allocators loose at most a
page for those long-lived chunks. However, lifetime information is not at all
available in the current API, and we need novel research into ways to mitigate
this issue. Segregating allocations of similar size classes seems to be the
best practice for this, and nothing stops glibc from doing the same approach
e.g. multiple heaps for segregated size classes.

While I am not dissuading you from working on a page-based extension to glibc's
malloc, I think tackling RSS reduction, perhaps using page-based methods, is
where the community should focus their attention. It would yield the largest
net win for our users based on the bugs and feedback we've been getting.

Again, I also reiterate it must be driven by workloads, to create a data-driven
optimization process.

> I planned to make this generic but it isn't case now as I realized
> posibility of using bit tricks which would lead to more effective free
> as it doesn't need to determine size of allocation.
>> Representation of small sizes is simple, for each 1024 page there would
>
> be metadata with 64-bit integer which has i-th bit 1 if and only if
> there is free chunk at offset 16*i of page which turns both allocation
> and free into simple bit manipulation.
>> Sample code how would inline look like is below. To handle space usage
> one would in refill_bucket set up bitmap pages.
> 
> For free a case to call noninline_free if bitmap is zero serves to
> register that there are some free chunks at that page.
> 
> For returning memory to different thread that created it one would use
> atomic operations with basically same algorithm.
>
> For medium same idea of denoting free into bitmap could work with same
> code , with bit
> operations I could quickly merge all adjacent free chunks into one and
> delete old entries.
> 
> This one is at idea stage, I don't know how effective it will be to
> merge multiple adjacent frees. I could do it with following bit tricks:
> 
> You need three bitmaps
> 
> allocated - 1 where allocated chunk starts
> old_free - 1 where free chunk that is in data structure starts
> new_free - 1 where chunk that was free'd but isn't in data structure
> yet.
> To handle are first construct mask with ones upto position of lowest
> 1 in new_free 
> 
> low_mask = new_free ^ (new_free - 1)
> 
> To get lower end you must compare 
> 
> old_free & low_mask > allocated & low_mask
> 
> if this is true then area starts on highest 1 bit of old_free &
> low_mask, otherwise it starts with lowest bit of new_free. In first case
> we need to remove that old_free element from data structure.
> 
> Then its easy to see that free area would end at position of lowest 1 in
> formula
> 
> allocated & (~low_mask)
> 
> Additional elements that need to be deleted from data structure are
> determined by following but we accumulate them into bitmap to_delete to handle
> them all together at end. 
> 
> to_delete |= old_free & high_end_mask & ~low_mask;
> 
> then we mask handled elements of new_free and repeat while new_free is
> nonzero.
> 
> 
> struct page_map {
>   int thread;
>   uint64_t map;
>   uint64_t atomic_map;
> };
>> extern inline struct page_map *get_page_map(void *p)
> {
>   size_t m = (size_t) p;
>   m = 1024 * (m / 1024) - 32;
>   return (struct page_map *) m;
> }
> 
> extern inline size_t base_addr (struct page_map *p)
> {
>    return ((size_t) p) + 32;
> }
> 
> struct page_map *buckets[256/16];
> 
> extern inline void *inl_alloc(size_t size)
> {
>   if (size < 256)
>>    {
>      int bucket = (size + 15) / 16;
>>      uint64_t map = buckets[bucket]->map;
>      /* note that we handle case of malloc(0) 
>         by always calling refill_buckets(0) with suitably filled  bucket[0] */
>      if (map == 0)
>        return refill_buckets (size);
>      uint64_t new_map = map & (map - 1);
>      size_t address = base_addr (buckets[bucket]) + 16 * __builtin_ctzl (map);
>      buckets[bucket]->map = new_map;
>      return (void *) address;
>    }
>   else
>     return big_alloc (size);
> }
> 
> extern inline void inl_free (void *p)
> {
> #ifndef PAGEMAP_NULL_VALID
>   if (p == NULL) return;
> #endif
>   struct page_map *m = get_page_map (p);
>   if (m->thread == thread_id && m->map != 0)
>     m->map |= 1 << ((((size_t) p) / 16) % 64);
>   else 
>     {
>       uint64_t n = m->atomic_map;
>       uint64_t nm = n | (1 << ((((size_t) p) / 16) % 64));
>       if (n == 0 || !atomic_compare_swap(&(m->atomic_map), n, nm))
>         noninline_free(p);
>     }
> }
> 
> 

-- 
Cheers,
Carlos.

Follow-Ups:
- Re: Possible inline malloc alternatives: bitmap
  - From: Paul Eggert
- Re: Possible inline malloc alternatives: bitmap
  - From: Ondřej Bílka

References:
- Possible inline malloc alternatives: bitmap
  - From: Ondřej Bílka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]