This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB


On Mon, Jun 29, 2015 at 06:48:19PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 29-06-2015 18:18, OndÅej BÃlka wrote:
> > On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
> >> Lets look at a real customer example. The customer wants to use the P8
> >> 128-bit add/sub but also wants to be able to unit test code on existing
> >> P7 machines. Which results in something like this:
> >>
> >> static inline vui32_t
> >> vec_addcuq (vui32_t a, vui32_t b)
> >> {
> >>         vui32_t t;
> >>
> >>                 if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))
> >>                 {
> >>                 
> >>                         __asm__(
> >>                             "vaddcuq %0,%1,%2;"
> >>                             : "=v" (t)
> >>                             : "v" (a),
> >>                               "v" (b)
> >>                             : );
> >>                 }
> >>                 else
> >>                         vui32_t c, c2, co;
> >>                         vui32_t z= {0,0,0,0};
> >>                         __asm__(
> >>                             "vaddcuw %3,%4,%5;\n"
> >>                             "\tvadduwm %0,%4,%5;\n"
> >>                             "\tvsldoi %1,%3,%6,4;\n"
> >>                             "\tvaddcuw %2,%0,%1;\n"
> >>                             "\tvadduwm %0,%0,%1;\n"
> >>                             "\tvor %3,%3,%2;\n"
> >>                             "\tvsldoi %1,%2,%6,4;\n"
> >>                             "\tvaddcuw %2,%0,%1;\n"
> >>                             "\tvadduwm %0,%0,%1;\n"
> >>                             "\tvor %3,%3,%2;\n"
> >>                             "\tvsldoi %1,%2,%6,4;\n"
> >>                             "\tvadduwm %0,%0,%1;\n"
> >>                             : "=&v" (t), /* 0 */
> >>                               "=&v" (c), /* 1 */
> >>                               "=&v" (c2), /* 2 */
> >>                               "=&v" (co) /* 3 */
> >>                             : "v" (a), /* 4 */
> >>                               "v" (b), /* 5 */
> >>                               "v" (z)  /* 6 */
> >>                             : );
> >>                         t = co;
> >>                 }
> >>         return (t);
> >> }
> >>
> >> So it is clear to me that executing 14+ instruction to decide if I can
> >> optimize to use new single instruction optimization is not a good deal.
> >>
> > No, this is prime example that average programmers shouldn't use hwcap
> > as that results in moronic code like this.
> > 
> > When you poorly reinvent wheel you will get terrible performance like
> > fallback here. Gcc already has 128 ints so tell average programmers to
> > use them instead and don't touch features that they don't understand.
> 
> Again your patronizing tone only shows your lack of knowledge in this
> subject: the above code aims to use ISA 2.07

Sorry, but could you explain how did you come to conclusion to use ISA
2.07? Only check done is

> >>                 if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))

And vsx is part of ISA 2.06

> *vector* instructions to
> multiply 128-bits integer in vector *registers*.

Your sentence has three problems.
1. From start of mail.
> > On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
> >> Lets look at a real customer example. The customer wants to use the P8
> >> 128-bit add/sub
2. Function is named vec_addcuq so accoring to name it does...? I guess
division.

3. Power isa describes these instructions pretty clearly:

Vector Add and Write Carry-Out Unsigned
Word                                        VX-form
vaddcuw      VRT,VRA,VRB
    4        VRT      VRA      VRB          384
0         6        11       16       21                31
do i=0 to 127 by 32
   aop      EXTZ((VRA)i:i+31)
   bop      EXTZ((VRB)i:i+31)
   VRTi:i+31    Chop( ( aop +int bop ) >>ui 32,1)
For each vector element i from 0 to 3, do the following.
    Unsigned-integer word element i in VRA is added
    to unsigned-integer word element i in VRB. The
    carry out of the 32-bit sum is zero-extended to 32
    bits and placed into word element i of VRT.
Special Registers Altered:
     None

If you still believe that it somehow does multiplication just try this
and see that result is all zeroes.

   __vector uint32_t x={3,2,0,3},y={0,0,0,0};
   y = vec_addcuq(x,x);
   printf("%i %i %i %i\n",y[0], y[1],y[2],y[3]);

Again your patronizing tone only shows your lack of knowledge of powerpc
assembly. Please study https://www.power.org/documentation/power-isa-v-2-07b/


I did mistake that I read to bit fast and seen only add instead of
instruction to get carry. Still thats with gpr two additions with carry,
then add zero with carry to set desired bit. 

> It has nothing to do
> with uint128_t support on GCC and only recently GCC added support to 
> such builtins [1]. And although there is plan to add support to use 
> vector instruction for uint128_t, right now they are done in GRP register
> in powerpc.
> 
Customer just wants to do 128 additions. If a fastest way
is with GPR registers then he should use gpr registers.

My claim was that this leads to slow code on power7. Fallback above
takes 14 cycles on power8 and 128bit addition is similarly slow.

Yes you could craft expressions that exploit vectors by doing ands/ors
with 128bit constants but if you mostly need to sum integers and use 128
bits to prevent overflows then gpr is correct choice due to transfer
cost.

> Also, it is up to developers to select the best way to use the CPU
> features.  Although I am not very found of providing the hwcap in TCB
> (my suggestion was to use local __thread in libgcc instead), the idea
> here is to provide *tools*.
>
If you want to provide tools then you should try to make best tool
possible instead of being satisfied with tool that poorly fits job and
is dangerous to use.

I am telling all time that there are better alternatives where this
doesn't matter.

One example would be write gcc pass that runs after early inlining to
find all functions containing __builtin_cpu_supports, cloning them to
replace it by constant and adding ifunc to automatically select variant.

You would also need to keep list of existing processor features to
remove nonexisting combinations. That easiest way to avoid combinatorial
explosion.



 
> [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html
> 
> > 
> > As gcc compiles addition into pair of addc, adde instructions a
> > performance gain is minimal while code is harder to maintain. Due to
> > pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one
> > on following example on power8.
> > 
> > 
> > int main()
> > {
> >   unsigned long i;
> >   __int128 u = 0;
> > //long u = 0;
> >   for (i = 0; i < 1000000000; i++)
> >     u += i * i;
> >   return u >> 35;
> > }
> > 
> > [neleai@gcc2-power8 ~]$ gcc uu.c -O3
> > [neleai@gcc2-power8 ~]$ time ./a.out 
> > 
> > real	0m0.957s
> > user	0m0.956s
> > sys	0m0.001s
> > 
> > [neleai@gcc2-power8 ~]$ vim uu.c 
> > [neleai@gcc2-power8 ~]$ gcc uu.c -O3
> > [neleai@gcc2-power8 ~]$ time ./a.out 
> > 
> > real	0m1.040s
> > user	0m1.039s
> > sys	0m0.001s
> 
> This is due the code is not using any vector instruction, which is the aim of the
> code snippet Steven has posted.

Wait do you want to have fast code or just show off your elite skills
with vector registers?

A vector 128bit addition is on power7 lot slower than 128bit addition in
gpr. This is valid use case when I produce 64bit integers and want to
compute their sum in 128bit variable. You could construct lot of use
cases where gpr wins, for example summing an array(possibly with applied
arithmetic expression).

Unless you show real world examples how could you prove that vector
registers are better choice?



>  Also, it really depends in which mode the CPU is
> set, on a POWER split-core mode, where the CPU dispatch groups are shared among
> threads in an non-dynamic way the difference is bigger:
> 
> [[fedora@glibc-ppc64le ~]$ time ./test
> 
> real    0m1.730s
> user    0m1.726s
> sys     0m0.003s
> [fedora@glibc-ppc64le ~]$ time ./test-long
> 
> real    0m1.593s
> user    0m1.591s
> sys     0m0.002s
> 
Difference? What difference? Only ratio matters to remove things like
different frequency of processors and that thread sharing slows you down
by constant. When I do math difference between these two ratios is 0.06%

1.593/1.730 = 0.9208092485549133
0.957/1.040 = 0.9201923076923076



> > 
> > 
> >  
> >> One instruction (plus the __builtin_cpu_supports which should be and
> >> immediate,  branch conditional) is a better deal. Inlining so the
> >> compiler can do common sub-expression about larger blocks is an even
> >> better deal.
> >>
> > That doesn't change fact that its mistake. A code above was bad as it
> > added check for single instruction that takes a cycle. When difference
> > between implementations is few cycles then each cycle matter (otherwise
> > you should just stick to generic one). Then a hwcap check itself causes
> > slowdown that matters and you should use ifunc to eliminate.
> > 
> > Or hope that its moved out of loop, but when its loop with 100
> > iterations a __builtin_cpu_supports time becomes imaterial.
> > 
> > 
> >> I just do not understand why there is so much resistance to this simple
> >> platform ABI specific request.
> > 
> > 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]