This is the mail archive of the
gsl-discuss@sources.redhat.com
mailing list for the GSL project.
Re: [Fwd: Re: random variate from power exponential distribution: continue]
On Wed, Oct 20, 2004 at 07:13:12AM -0400, Robert G. Brown was heard to remark:
...
> with trancendental function calls or not (these can be implemented in
> hardware or software with significantly varying speeds).
FYI, to hype the powerpc a bit: about a decade ago, I helped get
three new instructions into the powerpc instruction set: fres,
frsqrte and fsel.
The first provides an estimate for the reciprocal of a number, the
second provides an estimate for the square root. Each instruction
completes in one clock cycle. Normally, float-point division and
also square roots take lots of cpu cycles, as they require iteration
of an algorithm till it converges. The above instructions allow the
algorithm loops to be unrolled, and interleaved with other instructions,
avoiding pipeline stalls. You can double or triple performance
if your inner loop has lots of divides or square roots (and assuming
your compiler actually generates these insn's for division and sqaure
roots.)
Another serious cpu-cycle killer is the simple if-test, since
the cpu must figure out which branch is taken, then fetch the next
instruction after deciding the outcome of the test. It can take
many, many cycles to do an if test, and then it can take many more
cycles to get the next instruction from memory (if its not in cache
already) (a cache miss can takes *hundreds* of cpu cycles, where the
cpu sits *idle* while it waits for memory. Hundreds!! That's a
factor of a hundred slow-down if your algorithm bounces all over
system memory). Modern processors try to cope with this through
branch prediction, speculative execution and hardware threading.
But there is a special case:
double a,b,c,d;
if (a>0.0) {d = b; } else {d = c; }
which looks like a branch, but need not be. On the powerpc, this
is a single instruction (the fsel instruction) that executes in
*one* clock cycle. It allows blazingly-fast execution of the
bresenham algorithm or in fact any algorithm which makes floating
point value assignments based on the value of another float.
(e.g. min/max) It requires no branches, no branch prediction,
no speculative execution, no cache misses, its really amazing
what it can do for an algo that uses it.
--linas