This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The direction of malloc?


On Tue, Dec 10, 2013 at 10:42:54AM +0000, Will Newton wrote:
> On 10 December 2013 05:04, Carlos O'Donell <carlos@redhat.com> wrote:
> > For a long time we have refused patches to malloc because they would
> > change the formatting of the code and make it difficult to ... I can
> > only assume... synchronize with the rapid upstream development of
> > ptmalloc.
> >
> > It is my opinion that glibc should forge ahead on a new direction for
> > malloc.  We should accept any such patches that would:
> >
> > * Make it easier for us to maintain the malloc implementation.
> >   - Removing dead code.
> >   - Simplifying macros.
> >   - Removing features we don't use.
> >
> > * Add new and required features.
> >   - Add support for "have but don't need" kernel pages via vrange.
> >
> > * Fix bugs.
> >
> > We should accept these patches without the restrictions we previously
> > placed on the malloc implementation. We should use GNU formatting to
> > make the code match all of our other styles. In effect we should
> > own the copy of ptmalloc2 we started with and change it into the
> > implementation that we think our users need.
> 
> I think this is a sound approach. As far as I can tell the glibc
> malloc has diverged significantly from ptmalloc2 to the point of
> making changes hard to merge and we shouldn't let this stop us from
> making incremental improvements to what we have. Also the glibc code
> is by definition much more widely tested than any specific version of
> ptmalloc or dlmalloc I think.
> 
> There is ptmalloc3 but I suspect the upheaval of merging that would be
> equivalent to switching to any other 3rd party allocator and there are
> much more advanced allocators available these days.
>
There was proposal to use jemalloc, threading building blocks malloc,
ptmalloc3, I add tcmalloc modified to return memory to system, others?
 
Also there is possibility that we do not need to use whole allocator but
take ideas from these and make parts of allocator more effective.

> > I even encourage the discussion of providing alternate allocators
> > like jemalloc.
> 
> I have been looking into this and also developing my own allocator
> code. Unfortunately it is still at an early stage and not
> fast/featureful/stable enough yet. jemalloc is a really good allocator
> and provides excellent performance and profiling features, I had
> assumed the licensing would be an issue for integration into glibc
> though?
> 
Carlos, how we are with licensing

> One of the things I really need to get into shape is some kind of
> benchmarking for malloc, including some simple micro-benchmarks and
> some applications that can be used to test allocator performance. I
> would not like to embark on integrating a new allocator without some
> really solid benchmarks in place.
> 
Problem with microbenchmarks is that you would repeat errors of
allocator research in 60's where allocators were written on
microbenchmarks that did not match reality and real performance suffered.

When reading papers about allocators you must check if they make sense,
take this one, which measures fragmentation.

ftp://ftp.cs.utexas.edu/pub/garbage/malloc/ismm98.ps

This paper looks good in principle, they collect allocator/free data and
replay it. Still there is a flaw that they eliminate allocator overhead by 
calling malloc (size - overhead).

This makes allocator behave differently than in original program and
could theoreticaly change result. A proper way would be adding a counter
that tracks overhead of allocator and subtract that in calculations.


For benchmarking malloc you need to measure a running time and total
space usage of programs to get reliable results, microbenchmarks make
sense when you are sure that your optimalization does not change factors
that they cannot measure.

The tricky part about malloc is that performance depends on caches, for
example your application repeately allocates big array and does some
computation on that:

volatile char *x;
int foo ()
{
  free (x);
  x = calloc (30000);
}

Now if allocator can choose between 10 possible locations a best case is
use LIFO order which keeps that memory hot. Worst case is FIFO best-fit
where each allocation gives area that is not in L1 cache and for that we
must evict data from previous allocation.

I now play with that best-fit does not have to be best strategy, we
first look in recently freed chunks for best fit and take that if its
within 10% from best fit and similar heuristic.

This contrary to ptmalloc documentation that we use best-fit FIFO it
does not cause much harm. Small chunks are placed on stack and to
first pick from unsorted array also helps.

Then when you introduce thread this gets more complicated. You also
could have a problem that cache layout is different across cores.

For example you have two threads that allocate and frequently update
32-byte objects. When cache line contains objects from both threads
there will be lot of unnecessary ping-pong to keep coherence. 

This does not matter much for larger allocations as L3 cache is
typically shared.

A current implementation also mostly avoids this by having separate
arenas for threads that minimalize false sharing.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]