This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v2] benchtests: Add malloc microbenchmark


On 9 June 2014 21:33, OndÅej BÃlka <neleai@seznam.cz> wrote:
> On Mon, Jun 09, 2014 at 10:07:53PM +0530, Siddhesh Poyarekar wrote:
>> On Mon, Jun 09, 2014 at 04:14:35PM +0100, Will Newton wrote:
>> > > A maximum of 32K only tests arena allocation performance.  This is
>> > > fine for now since malloc+mmap performance is as interesting.  What is
>> <snip>
>> >
>> > There's at least two axes we are interested in - how performance
>> > scales with the number of threads and how performance scales with the
>> > allocation size. For thread performance (which this benchmark is
>> > about) the larger allocations are not so interesting - typically their
>> > locking overhead is in the kernel rather than userland and in terms of
>> > real world application performance its just not as likely to be a
>> > bottleneck as small allocations. We have to be pragmatic in which
>> > choices we make as the full matrix of threads versus allocation sizes
>> > would be pretty huge.
>>
>> Heh, I noticed my typo now - I meant to say that malloc+mmap
>> performance is *not* as interesting :)
>>
> Problem is that this benchmark does not measure a multithread
> performance well. Just spawning many threads does not say much, my guess
> is that locking will quicky cause convergence to state where at each
> core a thread with separate arena is running. Also it does not measure
> hard case when you allocate memory in one thread.
>
> I looked on multithread benchmark and it has additional flaws:
>
> Big variance, running time varies around by 10% accoss iterations,
> depending on how kernel schedules these. Running threads and measuring
> time after you join them measures a slowest thread so at end some cores
> are idle.

Thanks for the suggestion, I will look into this.

> Bad units, when I run a benchmark then with one benchmark a mean is:
> "mean": 91.605,
> However when we run 32 threads then it looks that it speeds malloc
> around three times:
>  "mean": 28.5883,

What is wrong with that? I assume you have a multi-core system, would
you not expect more threads to have higher throughput?

>> > So I guess I should probably also write a benchmark for allocation
>> > size for glibc as well...
>>
>> Yes, it would be a separate benchmark and probably would need some
>> specific allocation patterns rather than random sizes.  Of course
>> choosing allocation patterns is not going to be easy.
>>
> No, that was a benchmark that I posted which measured exactly what
> happens at given sizes.
>
>>
>> > > I don't know how useful max_rss would be since we're only doing a
>> > > malloc and never really writing anything to the allocated memory.
>> > > Smaller sizes may probably result in actual page allocation since we
>> > > write to the chunk headers, but probably not so for larger sizes.
>> >
>> > Yes, it is slightly problematic. What you probably want to to do is
>> > zero all the memory and measure RSS at that point but it would slow
>> > down the benchmark and spend lots of time in memset instead. At the
>> > moment it tells you how many pages are taken up by book-keeping but
>> > not how many of those pages your application would touch anyway.
>>
>> Oh I didn't mean to imply that we zero pages and try to get a more
>> accurate RSS value.  My point was that we could probably just do away
>> with it completely because it doesn't really tell us much - I can't
>> see how pages taken up by book-keeping would be useful.
>>
>> However if you do want to show resource usage, then address space
>> usage (VSZ) might show scary numbers due to the per-thread arenas, but
>> they would be much more representative.  Also, it might be useful to
>> see how address space usage scales with threads, especially for
>> 32-bit.
>>
> Still this would be worse than useless as it would vary wildly from real
> behaviour (for example it is typical that when there are allocations in
> quick succession then they will likely also deallocated in quick
> sucession.)  and that would cause us implement something that actually
> increases memory usage. It happened in 70's so do not repeat this
> mistake.
>
>> > No I haven't looked into that, so far I have been treating malloc as a
>> > black box and I'm hoping not to tailor teh benchmark too far to one
>> > implementation or another.
>>
>> I agree that the benchmark should not be tailored to the current
>> implementation, but then this behaviour would essentially be another
>> set of inputs.  Simply increasing the maximum size from 32K to about
>> 128K (that's the initial threshold for mmap anyway) might result in
>> that behaviour being triggered more frequently.
>>
> For malloc you need to benchmarks satisfy some conditions to be
> meaningful. When you compare different implementations one could use
> different memory allocation pattern. That could cause additional cache
> misses that dominate performance but you do not measure it in benchmark.
> Treating malloc as black-box kinda defeats a purpose.
>



-- 
Will Newton
Toolchain Working Group, Linaro


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]