This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] ARM: NEON detected memcpy.

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Richard Earnshaw <rearnsha at arm dot com>
Cc: Carlos O'Donell <carlos at redhat dot com>, "Joseph S. Myers" <joseph at codesourcery dot com>, "Shih-Yuan Lee (FourDollars)" <sylee at canonical dot com>, "patches at eglibc dot org" <patches at eglibc dot org>, "libc-ports at sourceware dot org" <libc-ports at sourceware dot org>, "rex dot tsai at canonical dot com" <rex dot tsai at canonical dot com>, "jesse dot sung at canonical dot com" <jesse dot sung at canonical dot com>, "yc dot cheng at canonical dot com" <yc dot cheng at canonical dot com>, Shih-Yuan Lee <fourdollars at gmail dot com>
Date: Tue, 9 Apr 2013 17:53:44 +0200
Subject: Re: [PATCH] ARM: NEON detected memcpy.
References: <CAAT15mNnqeb6tuVdV6b4uJf-qFDH1acxevyW6f-gH+SkguENmg at mail dot gmail dot com> <Pine dot LNX dot 4 dot 64 dot 1304031505020 dot 580 at digraph dot polyomino dot org dot uk> <5163D9B8 dot 7030008 at arm dot com> <51641077 dot 4000102 at redhat dot com> <51642CF3 dot 2040506 at arm dot com>

On Tue, Apr 09, 2013 at 04:00:03PM +0100, Richard Earnshaw wrote:
> On 09/04/13 13:58, Carlos O'Donell wrote:
> >On 04/09/2013 05:04 AM, Richard Earnshaw wrote:
> >>On 03/04/13 16:08, Joseph S. Myers wrote:
> >>>I was previously told by people at ARM that NEON memcpy wasn't a good idea
> >>>in practice because of raised power consumption, context switch costs etc.
> >>>from using NEON in processes that otherwise didn't use it, even if it
> >>>appeared superficially beneficial in benchmarks.
> >>
> >>What really matters is system power increase vs performance gain and
> >>what you might be able to save if you finish sooner.  If a 10%
> >>improvement to memcpy performance comes at a 12% increase in CPU
> >>power, then that might seem like a net loss.  But if the CPU is only
> >>50% of the system power, then the increase in system power increase
> >>is just half of that (ie 6%), but the performance improvement will
> >>still be 10%.  Note that 20% is just an example to make the figures
> >>easier here, I've no idea what the real numbers are, and they will be
> >>hightly dependent on the other components in the system: a back-lit
> >>display, in particular, will use a significant amount of power.
> >>
> >>It's also necessary to think about how the Neon unit in the processor
> >>is managed.  Is it power gated or simply clock gated.  Power gated
> >>regions are likely to have long power-up times (relative to normal
> >>CPU operations), but clock-gated regions are typically
> >>instantaneously available.
> >>
> >>Finally, you need to consider whether the unit is likely to be
> >>already in use.  With the increasing trend to using the hard-float
> >>ABI, VFP (and Neon) are generally much more widely used in code now
> >>than they were, so the other potential cost of using Neon (lazy
> >>context switching) is also likely to be a non-issue, than if the unit
> >>is almost never touched.
> >
> >My expectation here is that downstream integrators run the
> >glibc microbenchmarks, or their own benchmarks, measure power,
> >and engage the community to discuss alternate runtime tunings
> >for their systems.
> >
> >The project lacks any generalized whole-system benchmarking,
> >but my opinion is that  microbenchmarks are the best "first step"
> >towards achieving measurable performance goals (since whole-system
> >benchmarking is much more complicated).
> >
> >At present the only policy we have as a community is that faster
> >is always better.
>
I am rewriting my whole-system benchmarks to be more generic. 
Still measuring performance would be time consuming, benchmarks needs
minimaly hour to get enough data. 

Then I cannot replicate exact conditions of measurement. It depends on
what I do with computer which varies. 

There is problem with representability. I know how conditions for
popular programs (gcc, firefox) Most other programs show very similar
characteristic but I do not know anything about tail.

To get more direct feedback I also do record/replay benchmark, see my
previous mail.
> 
> You still have to be careful how you measure 'faster'.  Repeatedly
> running the same fragment of code under the same boundary conditions
> will only ever give you the 'warm caches' number (I, D and branch
> target), but if the code is called cold (or with different boundary
> conditions in the case of the Branch target cache) most of the time
> in real life, that's unlikely to be very meaningful.
> 
> R.
>

References:
- [PATCH] ARM: NEON detected memcpy.
  - From: Shih-Yuan Lee (FourDollars)
- Re: [PATCH] ARM: NEON detected memcpy.
  - From: Joseph S. Myers
- Re: [PATCH] ARM: NEON detected memcpy.
  - From: Richard Earnshaw
- Re: [PATCH] ARM: NEON detected memcpy.
  - From: Carlos O'Donell
- Re: [PATCH] ARM: NEON detected memcpy.
  - From: Richard Earnshaw

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]