This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.


On Mon, Sep 02, 2013 at 02:58:23PM +0100, Will Newton wrote:
> On 30 August 2013 20:26, Carlos O'Donell <carlos@redhat.com> wrote:
> > On 08/30/2013 02:48 PM, Will Newton wrote:
> >> On 30 August 2013 18:14, Carlos O'Donell <carlos@redhat.com> wrote:
> >>>> Ping?
> >>>
> >>> How did you test the performance?
> >>>
> >>> glibc has a performance microbenchmark, did you use that?
> >>
> >> No, I used the cortex-strings package developed by Linaro for
> >> benchmarking various string functions against one another[1].
> >>
> >> I haven't checked the glibc benchmarks but I'll look into that. It's
> >> quite a specific case that shows the problem so it may not be obvious
> >> which one is better however.
> >
> > If it's not obvious how is someone supposed to review this patch? :-)
> 
> With difficulty. ;-)
> 
> Joseph has raised some good points about the comments and I'll go back
> through the code and make sure everything is correct in that regard.
> The change was actually made to the copy of the code in cortex-strings
> some time ago but I delayed pushing the patch due to the 2.18 release
> so I have to refresh my memory somewhat.
> 
> Ideally we would have an agreed upon benchmark with which everyone
> could analyse the performance of the code on their systems, however
> that does not seem to exist as far as I can tell.
>
Well, for measuring performance about only way that everybody will agree
with is compile implementations as old.so and new.so and then use

LD_PRELOAD=old.so time cmd
LD_PRELOAD=new.so time cmd

in loop until you calculate that there is statistically significant
difference (provided that commands you use are representative enough).

For any other somebody will argue that its opposite because you
forgotten to take some factor into account.

Even when you change LD_PRELOAD=old.so implementation that to
accurately measure time spend in function it need not be enough.

You could have implementation that will be 5 cycles faster on that
benchmark but slower in reality because

1) Its code is 1000 bytes bigger than alternative. Gains in function
itself will be eaten by instruction cache misses outside function. 
Or
2) Function agressively prefetches data (say loop that prefetches 
lines 512 bytes after current buffer position). This makes benchmark
performance better but cache will be littered by data after buffer end
buffers, real performance suffers.
Or
3) For malloc saving metadata at same cache line as start of allocated
memory could make benchmark look bad due to cache misses. But it will
improve performance as user will write there and metadata write serves
as prefetch.
or
...


> > e.g.
> > http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c
> >
> > I would not call `multi' exhaustive, and while neither is the glibc performance
> > benchmark tests the glibc tests have received review from the glibc community
> > and are our preferred way of demonstrating performance gains when posting
> > performance patches.
> 
> The key advantage of the cortex-strings framework is that it allows
> graphing the results of benchmarks. Often changes to string function
> performance can only really be analysed graphically as otherwise you
> end up with a huge soup of numbers, some going up, some going down and
> it is very hard to separate the signal from the noise.
> 
Like following? http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_loop/results_rand/result.html

On real workloads of memcpy it is still bit hard to see what is going on:
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_loop/results_gcc/result.html


> The glibc benchmarks also have some other weaknesses that should
> really be addressed, hopefully I'll have some time to write patches
> for some of this work.
> 
How will you fix measuring in tight loop with same arguments and only 32 times?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]