This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Add malloc micro benchmark

From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
To: DJ Delorie <dj at redhat dot com>
Cc: "carlos at redhat dot com" <carlos at redhat dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, nd <nd at arm dot com>
Date: Thu, 28 Dec 2017 14:09:29 +0000
Subject: Re: [PATCH] Add malloc micro benchmark
Authentication-results: sourceware.org; auth=none
Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco dot Dijkstra at arm dot com;
Nodisclaimer: True
References: <DB6PR0801MB20538A414F1659748E703918833C0@DB6PR0801MB2053.eurprd08.prod.outlook.com> (message from Wilco Dijkstra on Mon, 18 Dec 2017 15:18:47 +0000),<xnr2rrabd9.fsf@greed.delorie.com>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:99

DJ Delorie wrote:
> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> > Since DJ didn't seem keen on increasing the tcache size despite it
> > showing major gains across a wide range of benchmarks,
>
> It's not that I'm not keen on increasing the size, it's that there are
> drawbacks to doing so and I don't want to base such a change on a guess
> (even a good guess).  If you have benchmarks, let's collect them and add
> them to the trace corpus.  I can send you my corpus.  (We don't have a
> good solution for centrally storing such a corpus, yet) Let's run all
> the tests against all the options and make an informed decision, that's
> all.  If it shows gains for synthetic benchmarks, but makes qemu slower,
> we need to know that.

Yes I'd be interested in the traces. I presume they are ISA independent and
can just be replayed?

> Also, as Carlos noted, there are some downstream uses where a larger
> cache may be detrimental.  Sometimes there are no universally "better"
> defaults, and we provide tunables for those cases.

It depends. I've seen cases where returning pages to the OS too quickly
causes a huge performance loss. I think in many of these cases we can
be far smarter and use adaptive algorithms. If say 50% of your memory
ends up in the tcache and you can't allocate a new block, it seems a good
idea to consolidate first. If it's less than 1%, why worry about it?

So short term there may be simple ways to tune tcache, eg. allow a larger
number of small blocks (trivial change), or limit total bytes in the tcache
(which could be dynamically increased as more memory is allocated).

Longer term we need to make arena's per-thread - see below.

> Again, tcache is intended to help the multi-threaded case.  Your patches
> help the single-threaded case.  If you recall, I ran your patch against
> my corpus of multi-threaded tests, and saw no regressions, which is
> good.

Arenas are already mostly per-thread. My observation was that the gains
from tcache are due to bypassing completely uncontended locks.
If an arena could be marked as owned by a thread, the fast single-threaded
paths could be used all of the time (you'd have to handle frees from other
threads of course but those could go in a separate bin for consolidation).

> So our paranoia here is twofold...
>
> 1. Make sure that when someone says "some benchmarks" we have those
>    benchmarks available to us, either as a microbenchmark in glibc or as
>    a trace we can simulate and benchmark.  No more random benchmarks! :-)

Agreed, it's quite feasible to create more traces and more microbenchmarks.

> 2. When we say a patch "is faster", let's run all our benchmarks and
>    make sure that we don't mean "on some benchmarks."  The whole point
>    of the trace/sim stuff is to make sure key downstream users aren't
>    left out of the optimization work, and end up with worse performance.

Well you can't expect gains on all benchmarks or have a "never regress
anything ever" rule. Minor changes in alignment of a heap block or allocation
of pages from the OS can have a large performance impact that's hard to
control. The smallest possible RSS isn't always better. The goal should be to
improve average performance across a wide range of applications.

> We probably should add "on all major architectures" too but that assumes
> we have machines on which we can run the benchmarks.

Szabolcs or I would be happy to run the traces on AArch64.

Wilco

Follow-Ups:
- Re: [PATCH] Add malloc micro benchmark
  - From: DJ Delorie

References:
- Re: [PATCH] Add malloc micro benchmark
  - From: Wilco Dijkstra
- Re: [PATCH] Add malloc micro benchmark
  - From: DJ Delorie

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]