This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: The direction of malloc?

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Torvald Riegel <triegel at redhat dot com>
Cc: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, libc-alpha at sourceware dot org
Date: Thu, 19 Dec 2013 15:16:08 +0100
Subject: Re: The direction of malloc?
Authentication-results: sourceware.org; auth=none
References: <20131210210541 dot GA19161 at domone dot podge> <1387213140 dot 23049 dot 8010 dot camel at triegel dot csb> <20131216212334 dot GA21284 at domone dot podge> <1387285197 dot 23049 dot 9075 dot camel at triegel dot csb> <20131217190817 dot GA32756 at domone dot podge> <1387324884 dot 23049 dot 10462 dot camel at triegel dot csb> <20131218101110 dot GB5472 at domone dot podge> <1387368668 dot 23049 dot 10661 dot camel at triegel dot csb> <20131218140501 dot GA8292 at domone dot podge> <1387377353 dot 23049 dot 11084 dot camel at triegel dot csb>

On Wed, Dec 18, 2013 at 03:35:53PM +0100, Torvald Riegel wrote:
> On Wed, 2013-12-18 at 15:05 +0100, OndÅej BÃlka wrote:
> > On Wed, Dec 18, 2013 at 01:11:08PM +0100, Torvald Riegel wrote:
> > > > > > Real implementation will be bit faster as dynamic tls slows this down a
> > > > > > bit.
> > > > > > 
> > > > > > Memory system effects are not a factor here, as allocation pattern is
> > > > > > identical (stack in both cases).
> > > > > 
> > > > > I was referring to memory system effects when the data is actually
> > > > > accessed by an application.  It does matter on which page allocated data
> > > > > ends up (and where on a page relative to other allocations).
> > > > 
> > > > And as it ends at same virtual address as algorithms are identical this
> > > > does not matter here.
> > > 
> > > You seemed to say that you want to move from concurrent code with
> > > synchronization to nonconcurrent code.  As far as I was able to
> > > interpret what you wrote, it seemed that you wanted to move from
> > > (potentially) concurrent allocation from fastbins(?) to strictly
> > > per-thread allocation bins.  Unless the former is concurrent and uses
> > > synchronization for no reason, it should be possible that you can have
> > > situations in which threads allocate from more areas than before.
> > >
> > Not neccessarily, you could add a check if it belong to thread and use
> > standard free if not, I wrote bit more about that here which was mainly
> > to get comments.
> > 
> > https://www.sourceware.org/ml/libc-alpha/2013-12/msg00280.html
> > 
> > > > >  The speed
> > > > > of allocations is just one thing.  For example, just to illustrate, if
> > > > > you realloc by copying to a different arena, this can slow down programs
> > > > > due to NUMA effects if those programs expect the allocation to likely
> > > > > remain on the same page after a realloc; such effects can be much more
> > > > > costly than a somewhat slower allocation.
> > > > 
> > > > You cannot optimize code for unlikely case.
> > > 
> > > We do NOT know what the unlikely case is, currently.  This is why I
> > > suggested to start with analyzing application workloads and access
> > > patterns, building a model of it (ie, informal but at a level of detail
> > > that is sufficient to actually agree on a clear set of assumptions and
> > > not just handwaving), document it, and discuss it with the rest of the
> > > community.
> > > 
> > > > When a memory is allocated
> > > > in thread A and reallocated in thread B there could be three cases
> > > > 
> > > > 1) Memory is passed to thread B which primarily access it.
> > > > 2) Memory is shared between A and B.
> > > > 3) Memory is primarily accessed by thread A.
> > > > 
> > > > As effect of cases 1) and 3) is symetrical
> > > 
> > > Yes, both can happen, and there might always be a trade-off, and however
> > > you decide, you might decrease performance in some situations.
> > > 
> > > > it suffices to estimate which
> > > > one is more likely and case 1 seems a best candidate.
> > > 
> > > We do NOT know that.  If you do, please show the evidence.
> > > 
> > > > Realloc definitely does move in most cases as common usage pattern is
> > > > doubling size allocated and as we use best fit there is not enough room.
> > > 
> > > How do you know it's really a common usage pattern?  And, why should it
> > > not just be common but one of the most common usage patterns?  What is
> > > common?  Which applications?  And so on...
> > >
> > It is about only way how avoid quadratic slowdown when repeately reallocating.
> 
> You didn't answer my questions.  I asked, among others why it should be
> the most common usage pattern.  Your answer is that *if* programs are
> repeatedly allocating, then they likely want to double the allocation
> size.  So, why are repeated reallocations the most common usage pattern?
> And why do we have to optimize for the overhead of the realloc calls
> themselves compared to how other applications might be using allocated
> memory differently (e.g., compared to NUMA access overheads, or whatever
> else...)?
>
There are more ifs, one if is how relevant are NUMA now. As
performance-oriented applications use a custom allocator with
appropriate hints, which is sanest course as we could only do a best decision 
with information that we get but not magic. 

Also these you also made unsubstained claim that on realloc memory is
expected to stay on same page which given a best-fit algorithm which
leaves little room for growgh and possibility of occupying next chunk by
other malloc.

This is more serious problem that hypotethical NUMA machines, no matter
possibility of unnecessary contention caused by realloc and that
removing this possibility simplifies next possible improvements.

We should handle real problem first then worry about hypotethical ones.

> I appreciate that you have ideas for how to improve the allocator.  But
> at the very least, those need to be validated *thoroughly*, and that
> involves looking at all applications using the allocator, not just
> applications where the optimization might help.
> 
> I believe that once you do a thorough evaluation, you will get a lot
> more ideas of how to perhaps improve things, and probably find ideas
> that can have a higher impact on performance.  Therefore, I would
> suggest to start with the analysis first, rather than starting with a
> specific idea and going bottom-up.
> 
> > Ideally this should not be neccessary as we preallocate twice than requested 
> > in realloc but we do not do this yet.
> 
> I doubt that this idea is worthwile to pursue, because you're making a
> very wide assumption here, namely that all uses of realloc would be part
> of frequent reallocation and that the space and address space overheads
> don't matter.
> 
These could be handled by writing a reallocation history at end of
allocated space. It would be guarded by sentinel so and in rare false
positive case worst that could happen is allocating more than requested.

Anyway unless you do some of measures like these you cannot expect that
a good performance

> > This affects gcc which repeately tries
> > extend buffer by 8. 
> 
> Maybe that's inefficient on GCC's behalf (although I suppose they have
> (or had, at some point) a reason to do it this way), but that doesn't
> mean we need to optimize towards just that in glibc's general purpose
> allocator.  Do you know why GCC made this choice?
>
This could be found by gdb:

#0  __GI___libc_realloc (oldmem=0x13c7530, bytes=176) at malloc.c:2913
#1  0x0000000000c7be7d in xrealloc ()
#2  0x000000000077ecc4 in register_one_dump_file(opt_pass*) ()
#3  0x000000000077ee06 in ?? ()
#4  0x000000000077ee16 in ?? ()
#5  0x0000000000780088 in init_optimization_passes() ()
#6  0x0000000000810fee in toplev_main(int, char**) ()
#7  0x00007ffff670a995 in __libc_start_main (main=0x509880 <main>,
argc=14, ubp_av=0x7fffffffdae8, init=<optimized out>, fini=<optimized
out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffdad8) at
libc-start.c:260
#8  0x0000000000509aef in _start ()

Here its a subcritical state as slowdown is not that big. It makes sense
to do optimization in glibc which improves all programs that use similar
pattern.
 
> > To test these use following program.
> 
> Why don't you build infrastructure to get similar traces for this and
> similar relevant behavior of applications, and ways for us to store and
> analyze this data with a high level of automation?  That is, collect the
> allocation patterns, and build analyses that feed into a model of how
> applications use the allocator?  Based on that, we can draw conclusions
> and evaluate changes to the allocator; and it would also allow us to
> test our assumptions in the future.

Yes, I will ping a dryrun which does these.

References:
- Re: The direction of malloc?
  - From: OndÅej BÃlka
- Re: The direction of malloc?
  - From: Torvald Riegel
- Re: The direction of malloc?
  - From: OndÅej BÃlka
- Re: The direction of malloc?
  - From: Torvald Riegel
- Re: The direction of malloc?
  - From: OndÅej BÃlka
- Re: The direction of malloc?
  - From: Torvald Riegel
- Re: The direction of malloc?
  - From: OndÅej BÃlka
- Re: The direction of malloc?
  - From: Torvald Riegel
- Re: The direction of malloc?
  - From: OndÅej BÃlka
- Re: The direction of malloc?
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]