This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Patch, AArch64] Optimized strcpy


On Thu, Dec 18, 2014 at 12:45:12PM +0000, Richard Earnshaw wrote:
> On 18/12/14 01:44, Adhemerval Zanella wrote:
> > On 17-12-2014 23:05, OndÅej BÃlka wrote:
> >> On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
> >>> This patch contains an optimized implementation of strcpy for AArch64
> >>> systems.  Benchmarking shows that it is approximately 20-25% faster than
> >>> the generic implementation across the board.
> >>>
> >> I looked quickly for patch, I found two microoptimizations below and
> >> probable performance problem.
> >>
> >> Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
> >> shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
> >> more likely than 9-16 bytes so you should optimize that case well.
> >>
> >> See number of calls graph in strcpy function at
> >>
> >> http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/results/result.html
> >>
> >> My main three strcpy users are mutt, firefox, bash/sh. As first two are
> >> interactive its hard to do direct benchmark. So you should try how that
> >> affects bash.
> >>
> >> You could try to measure bash running time directly as it uses strcpy
> >> relatively often. I measured following bash script and if you LD_PRELOAD
> >> a byte-by-byte loop [A] then it decreases performance by 10%.
> >>
> >> http://kam.mff.cuni.cz/~ondra/bashtest.sh
> > 
> > I gave a try to this benchmark and, at least on powerpc64, I could not see *any*
> > real strcpy usage running using bash (version 4.3.11).  And profilers also do
> > not show strcpy being a hotspots.  Also, and I don't know if it is intentional,
> > it is issuing 'gnuplot2'.
> > 
> > And I see your characterized strcpy being direct from desktop workloads, which
> > is a limited usage behavior.  For powerpc64, as example, 'firefox' and 'mutt'
> > is far from the default workloads running on it.
> > 
> > So the focus is not that '1-8 *are* hot path' or if you 'profiler' shows that
> > mostly of calls are less than 16 bytes, but what kind of workloads aarch64
> > implementation are trying to solve here.  It would be good if the patch
> > proposal disclaimer better what kind of tests it did, like if he only ran
> > the GLIBC benchtests of if he is trying to optimize for a real case usage.
> > 
> 
> I'm not targeting a specific workload; the code is intended to be
> generic.  On that basis I made the following assumptions:
> 
> String lengths would likely follow a negative exponential distribution
> over a large set of workloads.  Individual workloads would undoubtedly
> not follow such a distribution.
> 
> Sequential calls to the string routine would not have a predictable
> alignment or length of the source string, so branch predictors would be
> unlikely to be able to accurately predict the outcome of most branches.
> 
Which is true but for bit different reason, branch predictor cache is
limited and when there is considerable delay between calls even branch
that could be perfectly predicted will not be there.

In that situation knowing about static prediction help, most sensible archs assume
that branch that is not in cache will not be taken.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]