This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH v1.2] Improve unaligned memcpy and memmove.

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>
Cc: GNU C Library <libc-alpha at sourceware dot org>
Date: Mon, 21 Oct 2013 20:09:19 +0200
Subject: Re: [PATCH v1.2] Improve unaligned memcpy and memmove.
Authentication-results: sourceware.org; auth=none
References: <20130819085220 dot GB19541 at domone> <20130829153829 dot GA6105 at domone dot kolej dot mff dot cuni dot cz> <20131003220926 dot GA12203 at domone dot podge> <CAHjhQ93gDTLC9jh56PPXPf0DndUBxVd371Xpw1+vPM9HVnHHfw at mail dot gmail dot com> <20131004125248 dot GA23055 at domone dot podge> <CAHjhQ904LgYwKXjqPyTZp4SDoc6t7Q+cFhmhsLgXydFQ3vbHpg at mail dot gmail dot com> <20131004132942 dot GA23955 at domone dot podge> <CAHjhQ93knCMGRQCNtXc+PVmKx7NS7e1EBB9_=RRaTyb3FE2msQ at mail dot gmail dot com>

On Fri, Oct 04, 2013 at 05:46:51PM +0400, Liubov Dmitrieva wrote:
> I am surprised that rep is faster on Atom because Atom is known for slow reps...
> We should recheck it.
>
This does not suprise me much. Alternative there have complex control
flow which works well for ooo machines. Rep is exception as control flow
there is very simple. Second factor rep is much more icache friendly than
other implementations which small atom caches make more noticable.

A third factor could be data cache, when reading lot of aligned data from
main memory a rep is fastest alternative for most processors, see block
mode of
http://kam.mff.cuni.cz/~ondra/benchmark_string/core2/memcpy_profile_loop/results_rand_aligned_nocache/result.html

For rechecking I wrote a independent tool. It LD_PRELOAD given
implementation and measures total time spent. It calculates relative performance
with 95% confidence interval. 

This should count all factors but it has disadvantage that it is slow.
Difference caused by best and worst memcpy implementations is less than
1% so you need run it for day until variance becomes small enough. 

I ran this on varios processors, results with checker are here.

http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_consistency.tar.bz2

Results would need more time, biggest problem with them is frequency
switching. When variance and mean suddenly jump by big amount it was
probably caused by being rescheduled to idle core.

Could you try to run in screen consistency benchmark, it is done by

./benchmark | tee result/atom

You can see accumulated results by running following script.

./rep

> You probably should join that memcpy patches into one to simplify
> review and to make clear what version for which processor will be
> finally used.
> 
I will post that when I will have time.

References:
- Re: [PATCH v1.2] Improve unaligned memcpy and memmove.
  - From: OndÅej BÃlka
- Re: [PATCH v1.2] Improve unaligned memcpy and memmove.
  - From: OndÅej BÃlka
- Re: [PATCH v1.2] Improve unaligned memcpy and memmove.
  - From: Liubov Dmitrieva
- Re: [PATCH v1.2] Improve unaligned memcpy and memmove.
  - From: OndÅej BÃlka
- Re: [PATCH v1.2] Improve unaligned memcpy and memmove.
  - From: Liubov Dmitrieva

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]