This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: memcpy performance regressions 2.19 -> 2.24(5)

From: Erich Elsen <eriche at google dot com>
To: "Carlos O'Donell" <carlos at redhat dot com>
Cc: libc-alpha at sourceware dot org, "H.J. Lu" <hjl dot tools at gmail dot com>
Date: Fri, 5 May 2017 17:57:54 -0700
Subject: Re: memcpy performance regressions 2.19 -> 2.24(5)
Authentication-results: sourceware.org; auth=none
References: <CAOVZoAPo-A5-bRZFHeu_wvTASzh_4nYwmqfCVfHQ7h34GyWKAA@mail.gmail.com> <9c563a4b-424b-242f-b82f-4650ab2637f7@redhat.com>

Hi Carlos,

a/b) The number of runs is dependent on the time taken; the number
iterations was such that each size took at least 500ms for all
iterations.  For many of the smaller sizes this means 10-100 million
iterations, for the largest size, 64MB, it was ~60.  10 runs were
launched separately, the difference between the maximum and the
minimum average was never more than 6% for any size; all of the
regressions are larger than this difference (usually much larger).
The times on the spreadsheet are from a randomly chosen run - it would
be possible to use a median or average, but given the large size of
effect, it didn't seem necessary.

b) The machines were idle (background processes only) except for the
test being run.  Boost was disabled.  The benchmark is single
threaded.  I did not explicitly pin the process - but given that the
machine was otherwise idle - it would be surprising if it was
migrated.  I can add this to see if the results change.

c) The specific processors were E5-2699 (Haswell), E5-2696 (Ivy),
E5-2689 (Sandy); I don't have motherboard or memory info.  The kernel
on the benchmark machines is 3.11.10.

d)  Only bench-memcpy-large would expose the problem at the largest
sizes.  2.19 did not have bench-memcpy-large.  The current benchmarks
will not reveal the regressions on Ivy and Haswell in the intermediate
size range because they only correspond to the readwritecache case on
the spreadsheet.  That is, they loop over the same src and dst buffers
in the timing loop.

nocache means that both the src and dst buffers go through memory with
strides such that nothing will be cached.
readcache means that the src buffer is fixed, but the dst buffer
strides through memory.

To see the difference at the largest sizes with the bench-memcpy-large
you can run it twice; once forcing __x86_shared_non_temporal_threshold
to LONG_MAX so the non-temporal path is never taken.

e) Yes, I can do this. It needs to go through approval to share
publicly, will take a few days.

Thanks,
Erich

On Fri, May 5, 2017 at 11:09 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 05/05/2017 01:09 PM, Erich Elsen wrote:
>> I had a couple of questions:
>>
>> 1) Are the large regressions at large sizes for IvyBridge and
>> SandyBridge expected?  Is avoiding non-temporal stores a reasonable
>> solution?
>
> No large regressions are expected.
>
>> 2) Is it possible to fix the IvyBridge regressions by using model
>> information to force a specific implementation?  I'm not sure how
>> other cpus (AMD) would be affected if the selection logic was modified
>> based on feature flags.
>
> A different memcpy can be used for any detectable difference in hardware.
> What you can't do is select a different memcpy for a different range of
> inputs. You have to make the choice upfront with only the knowledge of
> the hardware as your input. Though today we could augment that choice
> with a glibc tunable set by the shell starting the process.
>
> I have questions of my own:
>
> (a) How statistically relevant were your results?
> - What are your confidence intervals?
> - What is your standard deviation?
> - How many runs did you average?
>
> (b) Was your machine hardware stable?
> - See:
> https://developers.redhat.com/blog/2016/03/11/practical-micro-benchmarking-with-ltrace-and-sched/
> - What methodology did you use to carry out your tests? Like CPU pinning.
>
> (c) Exactly what hardware did you use?
> - You mention IvyBridge and SandyBridge, but what exact hardware did
>   you use for the tests, and what exact kernel version?
>
> (d) If you run glibc's own microbenchmarks do you see the same
>     performance problems? e.g. make bench, and look at the detailed
>     bench-memcpy, bench-memcpy-large, and bench-memcpy-random results.
>
> https://sourceware.org/glibc/wiki/Testing/Builds
>
> (e) Are you willing to publish your microbencmark sources for others
>     to confirm the results?
>
> --
> Cheers,
> Carlos.

Follow-Ups:
- Re: memcpy performance regressions 2.19 -> 2.24(5)
  - From: H.J. Lu

References:
- memcpy performance regressions 2.19 -> 2.24(5)
  - From: Erich Elsen
- Re: memcpy performance regressions 2.19 -> 2.24(5)
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]