This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.


On Wed, Sep 04, 2013 at 12:35:46PM -0500, Ryan S. Arnold wrote:
> On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote:
> > 3. Provide acceptable performance for unaligned sizes without
> >    penalizing the aligned case
> 
> There are cases where the user can't control the alignment of the data
> being fed into string functions, and we shouldn't penalize them for
> these situations if possible, but in reality if a string routine shows
> up hot in a profile this is a likely culprit and there's not much that
> can be done once the unaligned case is made as stream-lined as
> possible.
> 
> Simply testing for alignment (not presuming aligned data) itself slows
> down the processing of aligned-data, but that's an unavoidable
> reality.

How expensive are unaligned loads on powerpc?  On x64 a penalty for
using them is smaller than alternatives(increased branch
misprediction...)

>  I've chatted with some compiler folks about the possibility
> of branching directly to aligned case labels in string routines if the
> compiler is able to detect aligned data.. but was informed that this
> suggestion might get me burned at the stake.
> 
You would need to improve gcc detection of alignments first. Now gcc
misses most of opportunities, even in following code gcc issues
retundant alignment checks:

#include <stdint.h>
char *foo(long *x){
 if (((uintptr_t)x)%16)
  return x+4;
 else {
  __builtin_memset(x,0,512);
  return x;
 }
}

If gcc guys fix that then we do not have to ask them anything. We could
just change headers to recognize aligned case like

#define strchr(x,c) ({ char *__x=x;\
  if (__builtin_constant_p(((uintptr_t)__x)%16) && !((uintptr_t)__x)%16)\
    strchr_aligned(__x,c);\
  else\
    strchr(__x,c);\
})

> > 4. Measure the effect of dcache pressure on function performance
> > 5. Measure effect of icache pressure on function performance.
> >
> > Depending on the actual cost of cache misses on different processors,
> > the icache/dcache miss cost would either have higher or lower weight
> > but for 1-3, I'd go in that order of priorities with little concern
> > for unaligned cases.
> 
> I know that icache and dcache miss penalty/costs are known for most
> architectures but not whether they're "published".  I suppose we can,
> at least, encourage developers for the CPU manufacturers to indicate
> in the documentation of preconditions which is more expensive,
> relative to the other if they're unable to indicate the exact costs of
> these misses.
>
These cost are relatively difficult to describe, take strlen on main
memory as example.
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strlen_profile/results_rand_nocache/result.html

Here we see hardware prefetcher in action. A time goes linearly with
size until 512 bytes and remains constant until 4096 bytes(switch to
block view) where it starts increasing at slower rate.
 
For core2 shape is similar except that plateau starts at 256 bytes and
ends at 1024 bytes.
http://kam.mff.cuni.cz/~ondra/benchmark_string/core2/strlen_profile/results_rand_nocache/result.html

AMD processors are different, phenomII performance is line, and for fx10
there is even area where time decreases with size. 
http://kam.mff.cuni.cz/~ondra/benchmark_string/phenomII/strlen_profile/results_rand_nocache/result.html 
http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/strlen_profile/results_rand_nocache/result.html 



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]