This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: memcpy performance regressions 2.19 -> 2.24(5)


On Fri, May 5, 2017 at 5:57 PM, Erich Elsen <eriche@google.com> wrote:
> Hi Carlos,
>
> a/b) The number of runs is dependent on the time taken; the number
> iterations was such that each size took at least 500ms for all
> iterations.  For many of the smaller sizes this means 10-100 million
> iterations, for the largest size, 64MB, it was ~60.  10 runs were
> launched separately, the difference between the maximum and the
> minimum average was never more than 6% for any size; all of the
> regressions are larger than this difference (usually much larger).
> The times on the spreadsheet are from a randomly chosen run - it would
> be possible to use a median or average, but given the large size of
> effect, it didn't seem necessary.
>
> b) The machines were idle (background processes only) except for the
> test being run.  Boost was disabled.  The benchmark is single
> threaded.  I did not explicitly pin the process - but given that the
> machine was otherwise idle - it would be surprising if it was
> migrated.  I can add this to see if the results change.
>
> c) The specific processors were E5-2699 (Haswell), E5-2696 (Ivy),
> E5-2689 (Sandy); I don't have motherboard or memory info.  The kernel
> on the benchmark machines is 3.11.10.
>
> d)  Only bench-memcpy-large would expose the problem at the largest
> sizes.  2.19 did not have bench-memcpy-large.  The current benchmarks
> will not reveal the regressions on Ivy and Haswell in the intermediate
> size range because they only correspond to the readwritecache case on
> the spreadsheet.  That is, they loop over the same src and dst buffers
> in the timing loop.
>
> nocache means that both the src and dst buffers go through memory with
> strides such that nothing will be cached.
> readcache means that the src buffer is fixed, but the dst buffer
> strides through memory.
>
> To see the difference at the largest sizes with the bench-memcpy-large
> you can run it twice; once forcing __x86_shared_non_temporal_threshold
> to LONG_MAX so the non-temporal path is never taken.

The purpose of using non-tempora store is to avoid cache pullution
so that cache is also available to other threads.  We can improve the
heuristic for non-temporal threshold.   But we can't give all cache to
a single thread by default.

As for Haswell, there are some cases where the SSSE3 memcpy in
glibc 2.19 is faster than the new AVX memcpy.  But the new AVX
memcpy is faster than the SSSE3 memcpy in majority of cases.  The
new AVX memcpy in glibc 2.24 replaces the old AVX memcpy in glibc
2.23. So there is no regression from 2.23 to 2.24.

I also  checked my glibc performance data.  For data > 32K,
__memcpy_avx_unaligned is slower than __memcpy_avx_unaligned_erms.
We have

/* Threshold to use Enhanced REP MOVSB.  Since there is overhead to set
   up REP MOVSB operation, REP MOVSB isn't faster on short data.  The
   memcpy micro benchmark in glibc shows that 2KB is the approximate
   value above which REP MOVSB becomes faster than SSE2 optimization
   on processors with Enhanced REP MOVSB.  Since larger register size
   can move more data with a single load and store, the threshold is
   higher with larger register size.  */
#ifndef REP_MOVSB_THRESHOLD
# define REP_MOVSB_THRESHOLD (2048 * (VEC_SIZE / 16))
#endif

We can change it if there is improvement in glibc benchmarks.


H.J.

> e) Yes, I can do this. It needs to go through approval to share
> publicly, will take a few days.
>
> Thanks,
> Erich
>
> On Fri, May 5, 2017 at 11:09 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>> On 05/05/2017 01:09 PM, Erich Elsen wrote:
>>> I had a couple of questions:
>>>
>>> 1) Are the large regressions at large sizes for IvyBridge and
>>> SandyBridge expected?  Is avoiding non-temporal stores a reasonable
>>> solution?
>>
>> No large regressions are expected.
>>
>>> 2) Is it possible to fix the IvyBridge regressions by using model
>>> information to force a specific implementation?  I'm not sure how
>>> other cpus (AMD) would be affected if the selection logic was modified
>>> based on feature flags.
>>
>> A different memcpy can be used for any detectable difference in hardware.
>> What you can't do is select a different memcpy for a different range of
>> inputs. You have to make the choice upfront with only the knowledge of
>> the hardware as your input. Though today we could augment that choice
>> with a glibc tunable set by the shell starting the process.
>>
>> I have questions of my own:
>>
>> (a) How statistically relevant were your results?
>> - What are your confidence intervals?
>> - What is your standard deviation?
>> - How many runs did you average?
>>
>> (b) Was your machine hardware stable?
>> - See:
>> https://developers.redhat.com/blog/2016/03/11/practical-micro-benchmarking-with-ltrace-and-sched/
>> - What methodology did you use to carry out your tests? Like CPU pinning.
>>
>> (c) Exactly what hardware did you use?
>> - You mention IvyBridge and SandyBridge, but what exact hardware did
>>   you use for the tests, and what exact kernel version?
>>
>> (d) If you run glibc's own microbenchmarks do you see the same
>>     performance problems? e.g. make bench, and look at the detailed
>>     bench-memcpy, bench-memcpy-large, and bench-memcpy-random results.
>>
>> https://sourceware.org/glibc/wiki/Testing/Builds
>>
>> (e) Are you willing to publish your microbencmark sources for others
>>     to confirm the results?
>>
>> --
>> Cheers,
>> Carlos.



-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]