This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: malloc: performance improvements and bugfixes


On Wed, 2016-01-27 at 15:50 -0800, JÃrn Engel wrote:
> On Wed, Jan 27, 2016 at 10:37:44PM +0100, Torvald Riegel wrote:
> > 
> > Second, you can't expect this community to look at code dumps that don't
> > follow the community's rules for contributions, for obvious reasons.
> 
> That is ok.  I happen to have made improvements to malloc.  Those
> improvements might help others as well.  If they do, they are now
> publically available.  If you don't want them, that is fine.  And it
> really doesn't matter why you don't want them.
> 
> I also don't expect you to improve libc malloc for my purposes.  I have
> a perfectly fine fork and can use that forever.  There are several other
> fine allocators available as well.
> 
> If someone wants to improve libc malloc, I can help.  Whether anyone
> working on libc malloc actually wants my help is doubtful - we don't
> seem to see eye-to-eye on very many things.  But I can help.

Thanks for the offer.  I don't think anyone in the glibc community would
not want help.  We do appreciate contributions, including those from
outside the project.  Nonetheless, the help that we can use needs to
work for us, both in terms of processes (e.g., the legal rules we have
for code contributions) and how we make technical decisions.  Regarding
the latter, we've been working hard on improving the code base and
making technical decisions more transparent (e.g., better documentation,
a clearer consensus/decision process, we've started adding
microbenchmarks to be able to track performance, etc.).  This might let
us move a little more slowly, but it's necessary to ensure that future
glibc maintainers still know how everything is supposed to work and why
it was built that way.

So, please don't be put off if we try to be thorough when making
technical decisions such as changing something in malloc.  It's not that
we're arguing against change, we just want to make changes that are
backed up by clear reasons and about which we are confident.

> > Third, without discussing workload patterns, it's not clear whether your
> > contributions actually improve the situation because we don't know what
> > "the situation" is precisely (as it depends on your use cases).  (Some
> > changes may be generally good, but I suspect that most affect some
> > trade-off (e.g., see the discussion with Szabolcs).
> 
> You don't have to improve my situation.  I would be happy if you improve
> the situation for any open-source program that currently used jemalloc
> or tcmalloc.  Target mariadb, if you like.

I didn't mean to single out your program.  The same would apply if we'd
have to improve a hypothetical "mariadb situation": We'd still have to
understand why they chose a different allocator, and whether any changes
is overall a good thing.  This would require understanding and
discussing workload patterns too.

> > > The kernel's auto-NUMA falls apart as soon as either the memory or the
> > > thread moves to the wrong node.  That happens all the time.  And the
> > > kernel will only give you NUMA-local memory once, at allocation time.
> > > If the application holds on to that memory for days or years, it such
> > > one-time decisions don't matter.
> > 
> > My recollection is that newer auto-NUMA (or a daemon -- I don't remember
> > the names precisely) will indeed move pages to the nodes they are
> > accessed from.  Either way, if the page or thread moves to a place
> > further away in the memory hierarchy, there's little the allocator can
> > do; what it can do is try to make that less likely, for example by
> > getting the initial allocation right (ie, allocate in a page that's
> > local at allocation time) -- but this again depends on the allocation
> > and usage patterns in the program to some extent.  Hence my question
> > about these in your program.
> 
> Interesting.  Moving pages around shouldn't cause too much latency and
> can fix long-term problems.  But if you move pages back and forth too
> often, there are almost certainly better solutions to the problem.  Even
> doing nothing is likely better.

Sure there are trade-offs. But it's another piece in the puzzle we need
to consider.  It also won't help with short-lived allocations that have
plenty of remote accesses.

> > > That is why my code calls getcpu() to see which NUMA node we are on
> > > right now and then returns memory from a NUMA-local arena.  The kernel
> > > might migrate the thread between getcpu() and memory access, but at
> > > least you get it right 99% of the time instead of 50% (for two nodes).
> > > 
> > > So if you want a model for this, how about:
> > > - multiple threads,
> > > - no thread affinities,
> > > - memory moves between threads,
> > > - application runs long enough to make kernel's decision irrelevant.
> > 
> > That's less than the level of detail we need.  It's obvious we have to
> > consider multiple threads, but how many exactly?
> > Not selecting thread affinities is a practical choice, though if your
> > program owns the machine as you indicated earlier, pinning threads to
> > CPUs isn't an unusual thing to do.
> > I'm not sure what you mean by "memory moves between threads".
> 
> I mean that threads pass memory allocated by malloc between each other.
> The memory is freed on a different thread than it was allocated on.

I would distinguish two things here: (1) whether the memory is released
by a thread on a different node, and (2) whether most of the accesses to
the allocation are from a node other than the node where the allocation
happened.  I'd be more concerned about (2) regarding NUMA effects than
about (1); synchronization overheads can be more expensive in the case
of (1), but it's fewer memory accesses.

> > You also include a fork torture test.  Is fork really called frequently
> > in your big application?
> 
> Not by a long shot.  But I was concerned about introducing a bug
> somewhere in the fork handling, so I created a stresstest.

Thanks for the clarification.  I thought all of them were supposed to be
the performance microbenchmarks.

> > > So if you genuinely want to help, maybe the best thing would be to
> > > extract the testcases from all these projects, create a new "malloctest"
> > > or whatever and make it easy to evaluate and compare allocators with
> > > your test.  Pass/fail would be great, benchmarks would be even better.
> > > Then shame the guilty parties into improvements.
> > 
> > Something similar is what we would suggest to users: If they believe
> > glibc has a performance problem, and they want to help fix it, they
> > should describe the problem and workload in sufficient detail, and add
> > performance tests, microbenchmarks, or similar that allows glibc folks
> > to (1) classify the performance problem, (2) have a measurement that is
> > considered to be a good indicator of the problem.  If we then can agree
> > that the performance problem is likely relevant for real-world use
> > cases, we have an actionable task for improving this (because it's clear
> > what the goal is, whether it can be measured, and it can be
> > regression-tested in the future).  Note that we'll also have to
> > trade-off against other uses if general-purpose code is affected, given
> > that we have to serve all users.
> 
> For whatever reasons, people don't seem to do that.  But maybe you can
> find out why the mariadb people moved to jemalloc.
> https://mariadb.org/mariadb-5-5-33-now-available/
> 
> Or you could ask this person.
> https://www.percona.com/blog/2013/03/08/mysql-performance-impact-of-memory-allocators-part-2/

Thanks for the pointers.

> Or you can grep every package that your distribution carries for one of
> the alternative allocators and ask the package maintainers.  They should
> be more willing to discuss the finer details, as their applications are
> all open source and the question of keeping details secret never
> matters.
> 
> I do not know how much of our implementation details we are willing to
> give out.  But being conservative and giving out as little as possible
> seems like a good first approach.  That's why I try to direct you
> elsewhere.  I hope you understand that.

I can understand that.  I would not expect an allocation strategy to be
sensitive information (e.g., if it is indeed a deficiency of the
allocator that's hit by many programs), but that is certainly your
decision to make.  If you find information that you are comfortable to
release, please get back to us.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]