This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[RFC] Add inline isinf, insnan, ...


On Wed, May 27, 2015 at 03:57:10PM +0100, Wilco Dijkstra wrote:
> OndÅej BÃlka wrote:
> > I raised this issue before but didn't wrote patch so I should do it now.
> > I would be silent about glibc as it shares same flaw as gcc.
> > 
> > Main problem that these functions try to be branchless. Which causes
> > performance regression for most applications versus branched code.
> 
> Being branchless is one issue indeed but the main issue they are never
> inlined on any target as GLIBC headers explicitly disable inlining by GCC.
> 
> > A problem is that predicted branch is free while conditional store
> > always cost cycle. So you need to have unpredictable branch to get
> > performance gain. When branch is 95% predicted then branchless code
> > wouldn't pay for itself if it adds one cycle versus branched and
> > misprediction costs 20 cycles.
> > 
> > And NaN is quite exceptional value so branches will almost always be
> > predicted. Otherwise user has other problems, like that if 5% of his
> > data are NaN's then result will likely be garbage.
> > 
> > Then you have problem that with modern gcc you wont likely save branch.
> > Most of these functions are surrounded by if. From gcc-4.9 it will
> > optimize out that branch as its predicated and it results in simpler
> > code.
> > 
> > More evidence about that is that I took assembly of benchmark below and
> > changed conditional move to jump which improves performance back by 10%
> > 
> > For showing that I wrote simple example of branched isinf that is around
> > 10% faster than builtin.
> 
> Note the GCC built-ins are actually incorrect and should not be used until
> they are fixed to use integer arithmetic. The GLIBC versions are never
> inlined on any target, and adding generic inline implementations gives a 4-6
> times speedup. Isnan, isnormal, isfinite, issignalling are equally trivial, 
> needing ~3-4 instructions. An optimized fpclassify implementation seems
> small enough to be fully inlineable (useful given it is used in lots of
> complex math functions), however it could be partially inlined like:
> 
> __glibc_likely(isnormal(x)) ? FP_NORMAL : __fpclassify(x)
> 
> Just checking, are you planning to post patches for these?
> 
Could you take care of actual patch? Now I have other projects so I
wouldn't track it effectively.

Mainly technical stuff like what header to use and how check arch uses
ieee754. Also would need adjustments for floats and 32bit architectures.

So I will just post what I think should be inline implementation. One
could try to improve running time and size.

As isinf I couldn't improve that without adding code size on hot path.
Problem with alternatives is that when user writes

if (isinf(x))

they cannot be eliminated like here where first instruction handles
that.

So I would still use following, on 32bits I would first check if lower
half is zero, then do same trick with upper half. Also big endian
adaptation is mechanical and bit boring.

extern __always_inline 
int
isinf (double dx)
{
  union {
    double d;
    uint64_t l;
  } u;
  u.d = dx;
  return 2 * u.l == 0xffe0000000000000 ? (u.l == 0x7ff0000000000000 ? 1 : -1) : 0;
}

If I cared only about performance but not size I would tried following,
which makes additional assumption that powers of two are also rare inputs.

int
isinf (double dx)
{
  union u {
    double d;
    uint64_t l;
  };
  union u u;
  u.d = dx;

  if (__builtin_expect ((u.l << 12) != 0, 1))
    return 0;

  return (u.l == 0x7ff0000000000000) ? 1 :
         (u.l == 0xfff0000000000000 ? -1 : 0);
}



As baseline check I would use finite/isfinite. Problem with other functions is
code size, on x64 creating big constant is 10 byte instruction.
Following doesn't need any constants and could be used as baseline

But first to macro to extract biased exponent from double

#define _EXPONENT(x) (((x) << 1) >> 53)

extern __always_inline
int 
finite (double d)
{
  union {
    double d;
    uint64_t l;
  } u;
  return __builtin_expect(EXPONENT(u.l) != 2047, 1);
}

For isnan you should first check !finite however you implement it.

extern __always_inline
int
isnan (double d)
{
  union {
    double d;
    uint64_t l;
  } u;
  return (EXPONENT(u.l) == 2047) && (u.l << 12);
}

For isnormal you check exponent range 1..2026 by exploiting unsigned
comparison

extern __always_inline
int
isnormal (double d)
{
  union {
    double d;
    uint64_t l;
  } u;
  return (EXPONENT(u.l) - 1 < 2046);
}

A issignaling would be similar, but wikipedia isn't too clear about
format so I let it for now.

And to answer fpclassify I would also inline it. Reason is that users
will likely use it in same function and gcc could optimize unneeded
branhces away so size wouldn't be that problem. Trick would be use
correct order.

extern __always_inline
int
fpclassify (double d)
{
  if (isnormal (d))
    return FP_NORMAL;
  union {
    double d;
    uint64_t l;
  } u;
  u.d = d;
  uint64_t mantissa = u.l << 12;
  uint64_t exponent = EXPONENT(u.l);
  if (mantissa && exponent)
    return FP_NAN;
  if (mantissa && !exponent)
    return FP_SUBNORMAL;
  if (!mantissa && exponent)
    return FP_INFINITE;
  if (!mantissa && !exponent)
    return FP_ZERO;
}


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]