This is the mail archive of the gsl-discuss@sources.redhat.com mailing list for the GSL project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Fwd: Re: random variate from power exponential distribution: continue]


On Wed, Oct 20, 2004 at 07:13:12AM -0400, Robert G. Brown was heard to remark:
...
> with trancendental function calls or not (these can be implemented in
> hardware or software with significantly varying speeds).

FYI, to hype the powerpc a bit:  about a decade ago, I helped get 
three new instructions into the powerpc instruction set: fres, 
frsqrte and fsel.  

The first provides an estimate for the reciprocal of a number, the
second provides an estimate for the square root. Each instruction
completes in one clock cycle.  Normally, float-point division and 
also square roots take lots of cpu cycles, as they require iteration 
of an algorithm till it converges.  The above instructions allow the
algorithm loops to be unrolled, and interleaved with other instructions, 
avoiding pipeline stalls.  You can double or triple  performance
if your inner loop has lots of divides or square roots (and assuming
your compiler actually generates these insn's for division and sqaure
roots.)

Another serious cpu-cycle killer is the simple if-test, since
the cpu must figure out which branch is taken, then fetch the next
instruction after deciding the outcome of the test.  It can take
many, many cycles to do an if test, and then it can take many more
cycles to get the next instruction from memory (if its not in cache
already)  (a cache miss can takes *hundreds* of cpu cycles, where the 
cpu sits *idle* while it waits for memory.  Hundreds!!  That's a
factor of a hundred slow-down if your algorithm bounces all over 
system memory).  Modern processors try to cope with this through
branch prediction, speculative execution and hardware threading.

But there is a special case:

double a,b,c,d;
if (a>0.0) {d = b; } else {d = c; }

which looks like a branch, but need not be. On the powerpc, this
is a single instruction (the fsel instruction) that executes in 
*one* clock cycle.  It allows blazingly-fast execution of the 
bresenham algorithm or in fact any algorithm which makes floating 
point value assignments based on the value of another float.
(e.g. min/max) It requires no branches, no branch prediction, 
no speculative execution, no cache misses, its really amazing 
what it can do for an algo that uses it.

--linas


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]