This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] [RFC] malloc: Reduce worst-case behaviour with madvise and refault overhead
- From: Mel Gorman <mgorman at suse dot de>
- To: Carlos O'Donell <carlos at redhat dot com>
- Cc: Rik van Riel <riel at redhat dot com>, KOSAKI Motohiro <kosaki dot motohiro at gmail dot com>, Konstantin Serebryany <kcc at google dot com>, Minchan Kim <minchan dot kim at gmail dot com>, libc-alpha at sourceware dot org
- Date: Tue, 10 Feb 2015 20:24:48 +0000
- Subject: Re: [PATCH] [RFC] malloc: Reduce worst-case behaviour with madvise and refault overhead
- Authentication-results: sourceware.org; auth=none
- References: <20150209140608 dot GD2395 at suse dot de> <54D91E06 dot 7060603 at redhat dot com> <20150209224947 dot GA21275 at suse dot de> <54DA25D0 dot 3050501 at redhat dot com> <20150210165414 dot GF2395 at suse dot de> <54DA5599 dot 2070804 at redhat dot com>
On Tue, Feb 10, 2015 at 02:01:45PM -0500, Carlos O'Donell wrote:
> >> Thus, instead of adding more complex heuristics to glibc's malloc I think
> >
> > Is it really that complex? It's a basic heuristic comparing two counters
> > that affects the alloc and free slow paths. The checks are neglible in
> > comparison to the necessary work if glibc calls into the kernel.
>
> It is complex. It's a time-based integral value that is going to
> complicate debugging, reproducers, and users will need to be educated
> as to how it works. We will likely also want a tunnable for it such that
> if it negatively impacts workloads we can disable it. The tunnable though
> adds runtime overhead, and that's all added complexity.
>
Ok, I cannot argue with that logic.
> Please keep in mind that glibc is a general purpose allocator which aims
> to be good at the average case. Unfortunately to be honest we lack any data
> with which to measure "average" workloads, which is why glibc as a community
> is looking at whole system benchmarks to gather workload data and integrate
> it directly into our microbenchmark (see benchtests/) or a future whole system
> benchmark framework.
>
Understood. For what it's worth, the kernel memory management has had
numerous similar discussions in the past.
> So while the technical complexity of the fix is low, that's not the only
> factor I measure when looking at a fix.
>
> A high quality contribution for changes to malloc should IMO include a
> microbenchmark or two to flesh out the matching workload against which the
> change was tested (see benchtests/README).
>
Ok. In this case, a benchmark exists but I would be hesitant to include it
in benchtests without better evidence that a number of workloads actually
suffer this problem. mariadb hits this during database initialisation but
it's negligible in comparison to the overall cost of that operation.
> >> we should be testing the addition of MADV_FREE in glibc's malloc and then
> >> supporting or adjusting the proposal for MADV_FREE or vrange dependent on
> >> the outcome.
> >>
> >
> > They are ortogonal to each other and both can exist side by side. MADV_FREE
> > is cheaper than MADV_DONTNEED but avoiding unnecessary system calls is far
> > cheaper. As for MADV_FREE vs vrange, my expectation is that MADV_FREE will
> > be merged in the current merge window. The patches are sitting in Andrew
> > Morton's mmotm tree but he has not sent a merge request yet.
>
> They are orthogonal in a technical sense, but they solve the same problem:
> Make glibc's malloc faster for workload X.
>
> My opinion is that MADV_FREE is a technically superior solution to rate
> limiting madvise() calls from malloc. Rate limiting or dynamic heuristics
> have been difficult to tune and support over the last decade, and I am
> very hesitant to accept more without serious discussion, justification,
> and data.
>
Understood.
> >> In the meantime we can talk about mitigating the problems in glibc's
> >> allocator for systems with old kernels, but it should not be the primary
> >> solution. In glibc we would conditionalize the changes against the first
> >> kernel version that included MADV_FREE, and when the minimum supported
> >> kernel version is higher than that we would remove the code in question.
> >>
> >> My suggested next steps are:
> >>
> >> (a) Test using kernel+MADV_FREE with a hacked glibc malloc that uses
> >> MADV_FREE, see how that performs, and inform upstream kernel.
> >>
> >
> > The upstream kernel developers in this case won't care what the performance
> > is (changelog is already set) although Michan would care if there was any
> > bug reports associated with its usage. MADV_FREE is likely to be supported
> > either way and it's up to the glibc folk whether they want to support it.
>
> We do want to support it. I also believe that upstream does care.
>
If glibc hits bug reports when MADV_FREE is used, they will *definitely*
care.
> >> (b) Continue discussion over rate-limiting MADV_DONTNEED as a temporary
> >> measure. Before we go any further here, please increase M_TRIM_THRESHOLD
> >> in ebizzy and see if that makes a difference? It should make a difference
> >> by increasing the threshold at which we trim back to the OS, both sbrk,
> >> and mmaps, and thus reduce the MADV_DONTNEED calls at the cost of increased
> >> memory pressure. Changing the default though is not a trivial thing, since
> >> it could lead to immediate OOM for existing applications that run close to
> >> the limit of RAM. Discussion and analysis will be required.
> >>
> >
> > Altering trim threshold does not appear to help but that's not really
> > surprising considering that it's only applied the main arena and not the
> > per-thread heaps. At least that is how I'm interpreting this check
>
> I agree. It looks like a bug to me. The arenas should from first principles
> all behave relatively the same, regardless of the underlying storage e.g.
> brk vs. mmap.
>
Cool. By co-incidence it happens to solve the problem for this
particular workload but the milage will vary considerable. At least it
can be tuned around without disabling per-thread heaps entirely/
> > You may have noticed that part of the patch takes mp_.mmap_threshold into
> > account in the shrinking decision but this was the wrong choice. I should
> > have at least used the trim threshold there. I had considered just using
> > trim threshold but worried that the glibc developers would not like it
> > as it requred manual tuning by the user.
>
> Please do not hesitate to ask an opinion on the list, like you are doing
> right now with your RFC. Just ask if there is a preference, choice, or
> rationale for using or not using trim_threshold.
>
> I see algorithm development going through various phases:
>
> (a) Implement algorithm, add tunnables.
>
> (b) Evaluate algorithm against workloads and tweak tunnables.
>
> (c) Develop dynamic tunning and test.
>
> (e) Replace tunnable with dynamic tunning.
>
> With glibc's malloc we've been at (b) for years. I hope that with some
> of the tunnables unification work we're doing that we can make (b),
> (c) and (d) easier for the core developers.
>
Thanks for the clarification.
> > If trim threshold was to be used as a workaround then I'd think we'd need
> > at this patch. Anyone using the ebizzy benchmark without knowing this will
> > still report a regression between distros with newer glibcs but at least
> > a google search might find this thread.
>
> The trimming threshold is another solution.
>
> Limiting the trimming should reduce the number of MADV_DONTNEED calls,
> and make all of the arenas behave more uniformly.
>
> > diff --git a/malloc/arena.c b/malloc/arena.c
> > index 886defb074a2..a78d4835a825 100644
> > --- a/malloc/arena.c
> > +++ b/malloc/arena.c
> > @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad)
> > }
> > top_size = chunksize (top_chunk);
> > extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
> > - if (extra < (long) pagesz)
> > + if (extra < (long) mp_.trim_threshold)
> > return 0;
> >
> > /* Try to shrink. */
> >
>
> This patch looks correct to me. I think it's a bug that non-main arenas
> don't use the trim threshold. All arenas should behave the same way,
> and thus make them easier to debug and use.
>
> How does the patch impact your testing?
>
ebizzy gets fixed by co-incidence. Another test case that has a mix of
mallocs and frees shows a reduction in the number of calls to madvise by
default and it's possible to eliminate it entirely.
> I might try this out in Fedora Rawhide for a few months.
>
> Would you post another patch with just this change and your test
> results?
>
No problem. It'll be posted within the next few minutes. The changelog
may be considered excessive but feel free to trim it (ba dum tisch, I'm
here all week).
--
Mel Gorman
SUSE Labs