This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The direction of malloc?


On Mon, 2013-12-16 at 22:23 +0100, OndÅej BÃlka wrote:
> On Mon, Dec 16, 2013 at 05:59:00PM +0100, Torvald Riegel wrote:
> > On Tue, 2013-12-10 at 22:05 +0100, OndÅej BÃlka wrote:
> > > We have fastbins that sorta do this but with several problems. 
> > > 1. They are not really lockless, for malloc they need a lock, only
> > > freeing will be when bug 15073 gets fixed.
> > 
> > Note that *properly tuned* locks can be as fast as lock-less code in
> > several cases.  When you use a, say, stock PTHREAD_MUTEX_NORMAL lock,
> > this won't be well-suited to short critical sections because it doesn't
> > spin.  PTHREAD_MUTEX_ADAPTIVE_NP spins, but the number of spinning
> > iterations seems to be an arbitrary choice, and isn't very useful in my
> > experiments so far.  Also, there's no back-off whatsoever in the
> > spinning.
> >
> As will said acquiring a lock is hot part in a single thread
> applications.

You said you wanted lock-less code.  You didn't say you want
nonconcurrent code.  Thus there is concurrency, and you need to
synchronize in some way.  If it involves modifications, you need a CAS
or other RMW op of some kind, typically, independent of whether you use
a lock or not.  Thus, the lock can have as many or fewer atomic RMW/CAS
ops than other concurrent code.  That's what I pointed out.

> Please explain how spinning could improve performance in single thread
> applications.

You spoke about lockless code, so obviously concurrent code.  My comment
was thus referring to concurrent code.  If you have a single-threaded
program, then you can avoid synchronization, obviously (ignoring
synchronization just for reentrancy...).

> > 
> > > Second problem is that fastbins are per-arena not per-thread which
> > > forces us to use atomic operations. These are expensive (typicaly more than 50 cycles).
> > 
> > Especially on x86, atomic operations that *hit in the cache* have become
> > very fast compared to their costs in the past.  I don't have current
> > numbers, but I believe the 50 cycle number is too high; I vaguely
> > remember 10-20.
> 
> A simple benchmark could check a real cost. A problem is that while for core2
> and i7 cost of CAS is around 20 cycles for bulldozer its still 50 cycles.

Sure, this differs per architecture.  But recent Intel CPUs *are*
common, so it's not quite correct to say that the latency (or
throughput, depending on the algorithm) of an atomic RMW/CAS op is
really the primary problem.

> Even with these a malloc+free pair contains 5 atomic instructions on
> fastbin path which gives 100 cycle penalty (malloc: lock, get item from bin, unlock,
> free: set bit that fastbins are in use, put item to bin) 

Maybe, but you still need to be careful when drawing conclusions from
that because there are more performance effects in allocation than just
the "cycle penalty" you mention (e.g., locality and other memory-system
effects caused by differences in where data is placed).

BTW, in general, you can't just add "cycles" like that unless you have a
very simple CPU.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]