This is the mail archive of the
libc-ports@sources.redhat.com
mailing list for the libc-ports project.
Re: [PATCH] ARM: NEON detected memcpy.
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: Richard Earnshaw <rearnsha at arm dot com>
- Cc: "Joseph S. Myers" <joseph at codesourcery dot com>, "Shih-Yuan Lee (FourDollars)" <sylee at canonical dot com>, "patches at eglibc dot org" <patches at eglibc dot org>, "libc-ports at sourceware dot org" <libc-ports at sourceware dot org>, "rex dot tsai at canonical dot com" <rex dot tsai at canonical dot com>, "jesse dot sung at canonical dot com" <jesse dot sung at canonical dot com>, "yc dot cheng at canonical dot com" <yc dot cheng at canonical dot com>, Shih-Yuan Lee <fourdollars at gmail dot com>
- Date: Tue, 09 Apr 2013 08:58:31 -0400
- Subject: Re: [PATCH] ARM: NEON detected memcpy.
- References: <CAAT15mNnqeb6tuVdV6b4uJf-qFDH1acxevyW6f-gH+SkguENmg at mail dot gmail dot com> <Pine dot LNX dot 4 dot 64 dot 1304031505020 dot 580 at digraph dot polyomino dot org dot uk> <5163D9B8 dot 7030008 at arm dot com>
On 04/09/2013 05:04 AM, Richard Earnshaw wrote:
> On 03/04/13 16:08, Joseph S. Myers wrote:
>> I was previously told by people at ARM that NEON memcpy wasn't a good idea
>> in practice because of raised power consumption, context switch costs etc.
>> from using NEON in processes that otherwise didn't use it, even if it
>> appeared superficially beneficial in benchmarks.
>
> What really matters is system power increase vs performance gain and
> what you might be able to save if you finish sooner. If a 10%
> improvement to memcpy performance comes at a 12% increase in CPU
> power, then that might seem like a net loss. But if the CPU is only
> 50% of the system power, then the increase in system power increase
> is just half of that (ie 6%), but the performance improvement will
> still be 10%. Note that 20% is just an example to make the figures
> easier here, I've no idea what the real numbers are, and they will be
> hightly dependent on the other components in the system: a back-lit
> display, in particular, will use a significant amount of power.
>
> It's also necessary to think about how the Neon unit in the processor
> is managed. Is it power gated or simply clock gated. Power gated
> regions are likely to have long power-up times (relative to normal
> CPU operations), but clock-gated regions are typically
> instantaneously available.
>
> Finally, you need to consider whether the unit is likely to be
> already in use. With the increasing trend to using the hard-float
> ABI, VFP (and Neon) are generally much more widely used in code now
> than they were, so the other potential cost of using Neon (lazy
> context switching) is also likely to be a non-issue, than if the unit
> is almost never touched.
My expectation here is that downstream integrators run the
glibc microbenchmarks, or their own benchmarks, measure power,
and engage the community to discuss alternate runtime tunings
for their systems.
The project lacks any generalized whole-system benchmarking,
but my opinion is that microbenchmarks are the best "first step"
towards achieving measurable performance goals (since whole-system
benchmarking is much more complicated).
At present the only policy we have as a community is that faster
is always better.
Cheers,
Carlos.