This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RE: [PATCH] Add math-inline benchmark

From: "Wilco Dijkstra" <wdijkstr at arm dot com>
To: 'OndÅej BÃlka' <neleai at seznam dot cz>
Cc: "GNU C Library" <libc-alpha at sourceware dot org>
Date: Fri, 10 Jul 2015 17:09:16 +0100
Subject: RE: [PATCH] Add math-inline benchmark
Authentication-results: sourceware.org; auth=none
References: <001c01d0a912$42357710$c6a06530$ at com> <20150622083657 dot GA3684 at domone> <000701d0b7fb$0f27b840$2d7728c0$ at com> <20150709124454 dot GA29625 at domone>

> OndÅej BÃlka wrote:
> On Mon, Jul 06, 2015 at 03:50:11PM +0100, Wilco Dijkstra wrote:
> >
> >
> > > OndÅej BÃlka wrote:
> > > But with latency hiding by using argument first suddenly even isnan and
> > > isnormal become regression.
> > >
> > >     for (i = 0; i < n; i++){ res += 3*sin(p[i] * 2.0);    \
> > >       if (func (p[i] * 2.0)) res += 5;}                   \
> > >
> > >
> > > __fpclassify_test2_t:   92929.4 37256.8
> > > __fpclassify_test1_t:   94020.1 35512.1
> > >       __fpclassify_t:   17321.2 13325.1
> > >         fpclassify_t:   8021.29 4376.89
> > >    __isnormal_inl2_t:   93896.9 38941.8
> > >     __isnormal_inl_t:   98069.2 46140.4
> > >           isnormal_t:   94775.6 36941.8
> > >       __finite_inl_t:   84059.9 38304
> > >           __finite_t:   96052.4 45998.2
> > >           isfinite_t:   93371.5 36659.1
> > >        __isinf_inl_t:   92532.9 36050.1
> > >            __isinf_t:   95929.4 46585.2
> > >              isinf_t:   93290.1 36445.6
> > >        __isnan_inl_t:   82760.7 37452.2
> > >            __isnan_t:   98064.6 45338.8
> > >              isnan_t:   93386.7 37786.4
> >
> > Can you try this with:
> >
> >     for (i = 0; i < n; i++)                               \
> >       { double tmp = p[i] * 2.0;    \
> >       if (sin(tmp) < 1.0) res++; if (func (tmp)) res += 5;}                   \
> >
> That doesn't change outcome:
> 
> __fpclassify_test2_t: 	99721	51051.6
> __fpclassify_test1_t: 	85015.2	43607.4
>       __fpclassify_t: 	13997.3	10475.1
>         fpclassify_t: 	13502.5	10253.6
>    __isnormal_inl2_t: 	76479.4	41531.7
>     __isnormal_inl_t: 	76526.9	41560.8
>           isnormal_t: 	76458.6	41547.7
>       __finite_inl_t: 	71108.6	33271.3
>           __finite_t: 	73031	37452.3
>           isfinite_t: 	73024.9	37447
>        __isinf_inl_t: 	68599.2	32792.9
>            __isinf_t: 	74851	40108.8
>              isinf_t: 	74871.9	40109.9
>        __isnan_inl_t: 	71100.8	33659.6
>            __isnan_t: 	72914	37592.4
>              isnan_t: 	72909.4	37635.8

That doesn't look correct - it looks like this didn't use the built-ins at all,
did you forget to apply that patch?

Anyway I received a new machine so now GLIBC finally builds for x64. Since 
there appear large variations from run to run I repeat the same tests 4 times 
by copying the FOR_EACH_IMPL loop. The first 1 or 2 are bad, the last 2 
converge to useable results. So I suspect frequency scaling is an issue here.

Without the sin(tmp) part I get:

   remainder_test2_t:   40786.9 192862
   remainder_test1_t:   43008.2 196311
__fpclassify_test2_t:   2856.56 3020.12
__fpclassify_test1_t:   3043.53 3135.89
      __fpclassify_t:   12500.6 10152.5
        fpclassify_t:   2972.54 3047.65
   __isnormal_inl2_t:   4619.55 14491.1
    __isnormal_inl_t:   12896.3 10306.7
          isnormal_t:   4254.42 3667.87
      __finite_inl_t:   3979.58 3991.6
          __finite_t:   7039.61 7039.37
          isfinite_t:   2992.65 2969.25
       __isinf_inl_t:   2852.1  3239.23
           __isinf_t:   8991.81 8813.44
             isinf_t:   3241.75 3241.54
       __isnan_inl_t:   4003.51 3977.73
           __isnan_t:   7054.54 7054.5
             isnan_t:   2819.66 2801.94

And with the sin() addition:

   remainder_test2_t:   105093  214635
   remainder_test1_t:   106635  218012
__fpclassify_test2_t:   64290.9 32116.6
__fpclassify_test1_t:   64365.1 32310.2
      __fpclassify_t:   72006.1 41607
        fpclassify_t:   64190.3 33450.1
   __isnormal_inl2_t:   65959.1 33672
    __isnormal_inl_t:   71875.7 41727.3
          isnormal_t:   65676.1 32826.1
      __finite_inl_t:   69600.6 35293.3
          __finite_t:   67653.8 38627.2
          isfinite_t:   64435.9 34904.9
       __isinf_inl_t:   68556.6 33176
           __isinf_t:   69066.4 39562.7
             isinf_t:   64755.5 34244.6
       __isnan_inl_t:   69577.3 34776.2
           __isnan_t:   67538.8 38321.3
             isnan_t:   63963   33276.6

The remainder test is basically math/w_remainder.c adapted to use __isinf_inl
and __isnan_inl (test1) or the isinf/isnan built-ins (test2).

>From this it seems that __isinf_inl is slightly better than the builtin, but
it does not show up as a regression when combined with sin or in the remainder
test.

So I don't see any potential regression here on x64 - in fact it looks like
inlining using the built-ins gives quite good speedups across the board. And 
besides inlining applications using GLIBC it also inlines a lot of callsites
within GLIBC that weren't previously inlined.

> > Basically GCC does the array read and multiply twice just like you told it
> > to (remember this is not using -ffast-math). Also avoid adding unnecessary
> > FP operations and conversions (which may interact badly with timing the
> > code we're trying to test).
> >
> And how do you know that most users don't use fp conversions in their
> code just before isinf? These interactions make benchtests worthless as
> in practice a different variant would be faster than one that you
> measure.

You always get such interactions, it's unavoidable. That's why I added some
actual math code that uses isinf/isnan to see how it performs in real life.

> > For me the fixed version still shows the expected answer: the built-ins are
> > either faster or as fast as the inlines. So I don't think there is any
> > regression here (remember also that previously there were no inlines at all
> > except for a few inside GLIBC, so the real speedup is much larger).
> 
> Thats arm only. So it looks that we need platform-specific headers and testing.

Well I just confirmed the same gains apply to x64.

Wilco

Follow-Ups:
- Re: [PATCH] Add math-inline benchmark
  - From: OndÅej BÃlka

References:
- RE: [PATCH] Add math-inline benchmark
  - From: Wilco Dijkstra
- Re: [PATCH] Add math-inline benchmark
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]