This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The direction of malloc?


On Wed, 2013-12-18 at 11:11 +0100, OndÅej BÃlka wrote:
> On Wed, Dec 18, 2013 at 01:01:24AM +0100, Torvald Riegel wrote:
> > On Tue, 2013-12-17 at 20:08 +0100, OndÅej BÃlka wrote:
> > > On Tue, Dec 17, 2013 at 01:59:57PM +0100, Torvald Riegel wrote:
> > > > On Mon, 2013-12-16 at 22:23 +0100, OndÅej BÃlka wrote:
> > > > > Please explain how spinning could improve performance in single thread
> > > > > applications.
> > > > 
> > > > You spoke about lockless code, so obviously concurrent code.  My comment
> > > > was thus referring to concurrent code.  If you have a single-threaded
> > > > program, then you can avoid synchronization, obviously (ignoring
> > > > synchronization just for reentrancy...).
> > > >
> > > And we for malloc use a switch variable to avoid lock path and set it
> > > when pthread_create is called? For reentancy a ordinary variable suffices.
> > 
> > Depending on the algorithm, even for reentrancy you might need atomic
> > operations (eg, to keep under control what the compiler does with the
> > code, or using a CAS to avoid pending stores).
> > 
> > > > > > 
> > > > > > > Second problem is that fastbins are per-arena not per-thread which
> > > > > > > forces us to use atomic operations. These are expensive (typicaly more than 50 cycles).
> > > > > > 
> > > > > > Especially on x86, atomic operations that *hit in the cache* have become
> > > > > > very fast compared to their costs in the past.  I don't have current
> > > > > > numbers, but I believe the 50 cycle number is too high; I vaguely
> > > > > > remember 10-20.
> > > > > 
> > > > > A simple benchmark could check a real cost. A problem is that while for core2
> > > > > and i7 cost of CAS is around 20 cycles for bulldozer its still 50 cycles.
> > > > 
> > > > Sure, this differs per architecture.  But recent Intel CPUs *are*
> > > > common, so it's not quite correct to say that the latency (or
> > > > throughput, depending on the algorithm) of an atomic RMW/CAS op is
> > > > really the primary problem.
> > > > 
> > > > > Even with these a malloc+free pair contains 5 atomic instructions on
> > > > > fastbin path which gives 100 cycle penalty (malloc: lock, get item from bin, unlock,
> > > > > free: set bit that fastbins are in use, put item to bin) 
> > > > 
> > > > Maybe, but you still need to be careful when drawing conclusions from
> > > > that because there are more performance effects in allocation than just
> > > > the "cycle penalty" you mention (e.g., locality and other memory-system
> > > > effects caused by differences in where data is placed).
> > > >
> > > Its not just that, pretty much every modern allocator uses some
> > > per-thread cache so it looks like good idea, also oprofile shows lock
> > > cmpxchg instruction as one of likely culprits. When I try a per-thread
> > > cache that I posted earlier on a test that does just malloc and free it
> > > nearly doubles performance.
> > > 
> > > There are several factors that come in play a first one is lack of
> > > locking, second is getting expects correct, third is saving a call of
> > > int_malloc. Then there are several cycles saved by omiting test for hook
> > > and malloc_perturb.
> > > 
> > > Real implementation will be bit faster as dynamic tls slows this down a
> > > bit.
> > > 
> > > Memory system effects are not a factor here, as allocation pattern is
> > > identical (stack in both cases).
> > 
> > I was referring to memory system effects when the data is actually
> > accessed by an application.  It does matter on which page allocated data
> > ends up (and where on a page relative to other allocations).
> 
> And as it ends at same virtual address as algorithms are identical this
> does not matter here.

You seemed to say that you want to move from concurrent code with
synchronization to nonconcurrent code.  As far as I was able to
interpret what you wrote, it seemed that you wanted to move from
(potentially) concurrent allocation from fastbins(?) to strictly
per-thread allocation bins.  Unless the former is concurrent and uses
synchronization for no reason, it should be possible that you can have
situations in which threads allocate from more areas than before.

> >  The speed
> > of allocations is just one thing.  For example, just to illustrate, if
> > you realloc by copying to a different arena, this can slow down programs
> > due to NUMA effects if those programs expect the allocation to likely
> > remain on the same page after a realloc; such effects can be much more
> > costly than a somewhat slower allocation.
> 
> You cannot optimize code for unlikely case.

We do NOT know what the unlikely case is, currently.  This is why I
suggested to start with analyzing application workloads and access
patterns, building a model of it (ie, informal but at a level of detail
that is sufficient to actually agree on a clear set of assumptions and
not just handwaving), document it, and discuss it with the rest of the
community.

> When a memory is allocated
> in thread A and reallocated in thread B there could be three cases
> 
> 1) Memory is passed to thread B which primarily access it.
> 2) Memory is shared between A and B.
> 3) Memory is primarily accessed by thread A.
> 
> As effect of cases 1) and 3) is symetrical

Yes, both can happen, and there might always be a trade-off, and however
you decide, you might decrease performance in some situations.

> it suffices to estimate which
> one is more likely and case 1 seems a best candidate.

We do NOT know that.  If you do, please show the evidence.

> Realloc definitely does move in most cases as common usage pattern is
> doubling size allocated and as we use best fit there is not enough room.

How do you know it's really a common usage pattern?  And, why should it
not just be common but one of the most common usage patterns?  What is
common?  Which applications?  And so on...

The realloc case is just one example, and my point is not to be just
picky about this one issue.  Instead, this is meant to show that, IMO,
you need more preparation and analysis before you can start meaningful
changes to the allocator that the community can be confident about.

For example, you make assumptions without even spelling them out
explicitly when proposing the change.  But we need our assumptions to be
explicit, or this will get lost over time.  Also, it would be better to
get evidence that making a certain assumption is okay for our targeted
set of sample applications, and a test that confirms this assumption for
an application -- but for that, we need to make an assumption about what
the sample applications should be that represent our target workload.
Those assumptions need to be shared, and agreed on in the community.  If
we don't do that, this is just handwaving and will be forgotten in a few
years; then, somebody else comes along, doesn't see the assumptions you
made and starts with different ones, and removes your code.

IOW, if you want to do allocator research, then please do the research
in a proper, thorough way.  This includes that you should show the
community that your reasoning and assumptions are thorough, otherwise
the community can't get confidence that it makes sense to apply your
changes.  Likewise for arguing with users; especially for something like
allocation that is a lot about making trade-offs.

Differently from one-off research projects, we need make sure that our
implementation stays good over time.  Thus, we need to regression-test
our assumptions.  If you've read allocation literature, you'll have seen
different ways to analyze allocation patterns (e.g., traces,
memory-access traces or stats, ...).  What about starting (/continuing)
at this stage, and building something in this space that's practical for
glibc?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]