This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The direction of malloc?


On Wed, Dec 18, 2013 at 01:01:24AM +0100, Torvald Riegel wrote:
> On Tue, 2013-12-17 at 20:08 +0100, OndÅej BÃlka wrote:
> > On Tue, Dec 17, 2013 at 01:59:57PM +0100, Torvald Riegel wrote:
> > > On Mon, 2013-12-16 at 22:23 +0100, OndÅej BÃlka wrote:
> > > > Please explain how spinning could improve performance in single thread
> > > > applications.
> > > 
> > > You spoke about lockless code, so obviously concurrent code.  My comment
> > > was thus referring to concurrent code.  If you have a single-threaded
> > > program, then you can avoid synchronization, obviously (ignoring
> > > synchronization just for reentrancy...).
> > >
> > And we for malloc use a switch variable to avoid lock path and set it
> > when pthread_create is called? For reentancy a ordinary variable suffices.
> 
> Depending on the algorithm, even for reentrancy you might need atomic
> operations (eg, to keep under control what the compiler does with the
> code, or using a CAS to avoid pending stores).
> 
> > > > > 
> > > > > > Second problem is that fastbins are per-arena not per-thread which
> > > > > > forces us to use atomic operations. These are expensive (typicaly more than 50 cycles).
> > > > > 
> > > > > Especially on x86, atomic operations that *hit in the cache* have become
> > > > > very fast compared to their costs in the past.  I don't have current
> > > > > numbers, but I believe the 50 cycle number is too high; I vaguely
> > > > > remember 10-20.
> > > > 
> > > > A simple benchmark could check a real cost. A problem is that while for core2
> > > > and i7 cost of CAS is around 20 cycles for bulldozer its still 50 cycles.
> > > 
> > > Sure, this differs per architecture.  But recent Intel CPUs *are*
> > > common, so it's not quite correct to say that the latency (or
> > > throughput, depending on the algorithm) of an atomic RMW/CAS op is
> > > really the primary problem.
> > > 
> > > > Even with these a malloc+free pair contains 5 atomic instructions on
> > > > fastbin path which gives 100 cycle penalty (malloc: lock, get item from bin, unlock,
> > > > free: set bit that fastbins are in use, put item to bin) 
> > > 
> > > Maybe, but you still need to be careful when drawing conclusions from
> > > that because there are more performance effects in allocation than just
> > > the "cycle penalty" you mention (e.g., locality and other memory-system
> > > effects caused by differences in where data is placed).
> > >
> > Its not just that, pretty much every modern allocator uses some
> > per-thread cache so it looks like good idea, also oprofile shows lock
> > cmpxchg instruction as one of likely culprits. When I try a per-thread
> > cache that I posted earlier on a test that does just malloc and free it
> > nearly doubles performance.
> > 
> > There are several factors that come in play a first one is lack of
> > locking, second is getting expects correct, third is saving a call of
> > int_malloc. Then there are several cycles saved by omiting test for hook
> > and malloc_perturb.
> > 
> > Real implementation will be bit faster as dynamic tls slows this down a
> > bit.
> > 
> > Memory system effects are not a factor here, as allocation pattern is
> > identical (stack in both cases).
> 
> I was referring to memory system effects when the data is actually
> accessed by an application.  It does matter on which page allocated data
> ends up (and where on a page relative to other allocations).

And as it ends at same virtual address as algorithms are identical this
does not matter here.

>  The speed
> of allocations is just one thing.  For example, just to illustrate, if
> you realloc by copying to a different arena, this can slow down programs
> due to NUMA effects if those programs expect the allocation to likely
> remain on the same page after a realloc; such effects can be much more
> costly than a somewhat slower allocation.

You cannot optimize code for unlikely case. When a memory is allocated
in thread A and reallocated in thread B there could be three cases

1) Memory is passed to thread B which primarily access it.
2) Memory is shared between A and B.
3) Memory is primarily accessed by thread A.

As effect of cases 1) and 3) is symetrical it suffices to estimate which
one is more likely and case 1 seems a best candidate.

Realloc definitely does move in most cases as common usage pattern is
doubling size allocated and as we use best fit there is not enough room.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]