This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Split mantissa calculation loop and add branchprediction to mp multiplication


On Thu, 2013-01-03 at 09:08 +0530, Siddhesh Poyarekar wrote:
> On Wed, Jan 02, 2013 at 02:20:13PM -0600, Steven Munroe wrote:
> > I do not understand what you are doing here. If the intent is to replace
> > the X[], Y[], Z[] doubles with int's you will get overflows in Z[] if
> > you are changing X[], y[]. Z[] with uint64_t then you avoid the
> > overflows but (Z[k] + CUTTER)-CUTTER has no effect and you have not
> > saved any space. Also u is still a double, so you are adding some
> > expensive int->float->int converts to the inter loop. 
> 
> I don't convert mantissa to int and leave everything as is.  I had
> posted the patch to do that earlier, which has not been commented upon
> yet and that's the one you should be looking at; this patch has a
> different purpose:
> 
> http://sourceware.org/ml/libc-alpha/2012-12/msg00354.html
> 
> None of the problems you're claiming will exist because:
> 
> (1) The product is computed and stored in 64-bit
> 
> (2) u does not exist since it is replaced by a much simpler operation,
>     which results in that snippet looking like this:
> 
>     int64_t tmp = Z[k];
>     for (i=i1,j=i2-1; i<i2; i++,j--)
>       tmp += (int64_t) X[i]*Y[j];
> 
>     Z[k]  = (int) (tmp % (1 << 24));
>     Z[--k] = (int) (tmp / (1 << 24));
> 
This is very bad for POWER. PowerPC has (multiple) independent fixed
point and floating point pipelines. This allow super-scalar out-of-order
execution, UNTIL you force a transfer (through memory) between the
FPRs/GPRs. PowerPC has lots of registers (32+32+32), we expect the
compiler to keep lots of data in the registers, and so we don't optimize
the hardware for dependent load after store, we optimize for memory
bandwidth.

You proposed code forces an (unnecessary) double->long conversion and
FPR to GPR transfer into the inner loop, disabling any super-scalar
parallel execution. It also prevents loop unrolling and does not allow
GCC to make good use of all those registers we provide in the
architecture.

So your code is optimized for (register poor, in-order-execution) X86 at
the expense of PowerPC.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]