This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 30 Jun 2015 05:14:09 +0200
- Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- Authentication-results: sourceware.org; auth=none
- References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <5576FC80 dot 1090806 at arm dot com> <1433862393 dot 21101 dot 9 dot camel at sjmunroe-ThinkPad-W500> <5591239A dot 9030907 at twiddle dot net> <1435603025 dot 5485 dot 23 dot camel at oc7878010663> <20150629211831 dot GA23965 at domone> <5591BD23 dot 6090501 at linaro dot org>
On Mon, Jun 29, 2015 at 06:48:19PM -0300, Adhemerval Zanella wrote:
>
>
> On 29-06-2015 18:18, OndÅej BÃlka wrote:
> > On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
> >> Lets look at a real customer example. The customer wants to use the P8
> >> 128-bit add/sub but also wants to be able to unit test code on existing
> >> P7 machines. Which results in something like this:
> >>
> >> static inline vui32_t
> >> vec_addcuq (vui32_t a, vui32_t b)
> >> {
> >> vui32_t t;
> >>
> >> if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))
> >> {
> >>
> >> __asm__(
> >> "vaddcuq %0,%1,%2;"
> >> : "=v" (t)
> >> : "v" (a),
> >> "v" (b)
> >> : );
> >> }
> >> else
> >> vui32_t c, c2, co;
> >> vui32_t z= {0,0,0,0};
> >> __asm__(
> >> "vaddcuw %3,%4,%5;\n"
> >> "\tvadduwm %0,%4,%5;\n"
> >> "\tvsldoi %1,%3,%6,4;\n"
> >> "\tvaddcuw %2,%0,%1;\n"
> >> "\tvadduwm %0,%0,%1;\n"
> >> "\tvor %3,%3,%2;\n"
> >> "\tvsldoi %1,%2,%6,4;\n"
> >> "\tvaddcuw %2,%0,%1;\n"
> >> "\tvadduwm %0,%0,%1;\n"
> >> "\tvor %3,%3,%2;\n"
> >> "\tvsldoi %1,%2,%6,4;\n"
> >> "\tvadduwm %0,%0,%1;\n"
> >> : "=&v" (t), /* 0 */
> >> "=&v" (c), /* 1 */
> >> "=&v" (c2), /* 2 */
> >> "=&v" (co) /* 3 */
> >> : "v" (a), /* 4 */
> >> "v" (b), /* 5 */
> >> "v" (z) /* 6 */
> >> : );
> >> t = co;
> >> }
> >> return (t);
> >> }
> >>
> >> So it is clear to me that executing 14+ instruction to decide if I can
> >> optimize to use new single instruction optimization is not a good deal.
> >>
> > No, this is prime example that average programmers shouldn't use hwcap
> > as that results in moronic code like this.
> >
> > When you poorly reinvent wheel you will get terrible performance like
> > fallback here. Gcc already has 128 ints so tell average programmers to
> > use them instead and don't touch features that they don't understand.
>
> Again your patronizing tone only shows your lack of knowledge in this
> subject: the above code aims to use ISA 2.07
Sorry, but could you explain how did you come to conclusion to use ISA
2.07? Only check done is
> >> if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))
And vsx is part of ISA 2.06
> *vector* instructions to
> multiply 128-bits integer in vector *registers*.
Your sentence has three problems.
1. From start of mail.
> > On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
> >> Lets look at a real customer example. The customer wants to use the P8
> >> 128-bit add/sub
2. Function is named vec_addcuq so accoring to name it does...? I guess
division.
3. Power isa describes these instructions pretty clearly:
Vector Add and Write Carry-Out Unsigned
Word VX-form
vaddcuw VRT,VRA,VRB
4 VRT VRA VRB 384
0 6 11 16 21 31
do i=0 to 127 by 32
aop EXTZ((VRA)i:i+31)
bop EXTZ((VRB)i:i+31)
VRTi:i+31 Chop( ( aop +int bop ) >>ui 32,1)
For each vector element i from 0 to 3, do the following.
Unsigned-integer word element i in VRA is added
to unsigned-integer word element i in VRB. The
carry out of the 32-bit sum is zero-extended to 32
bits and placed into word element i of VRT.
Special Registers Altered:
None
If you still believe that it somehow does multiplication just try this
and see that result is all zeroes.
__vector uint32_t x={3,2,0,3},y={0,0,0,0};
y = vec_addcuq(x,x);
printf("%i %i %i %i\n",y[0], y[1],y[2],y[3]);
Again your patronizing tone only shows your lack of knowledge of powerpc
assembly. Please study https://www.power.org/documentation/power-isa-v-2-07b/
I did mistake that I read to bit fast and seen only add instead of
instruction to get carry. Still thats with gpr two additions with carry,
then add zero with carry to set desired bit.
> It has nothing to do
> with uint128_t support on GCC and only recently GCC added support to
> such builtins [1]. And although there is plan to add support to use
> vector instruction for uint128_t, right now they are done in GRP register
> in powerpc.
>
Customer just wants to do 128 additions. If a fastest way
is with GPR registers then he should use gpr registers.
My claim was that this leads to slow code on power7. Fallback above
takes 14 cycles on power8 and 128bit addition is similarly slow.
Yes you could craft expressions that exploit vectors by doing ands/ors
with 128bit constants but if you mostly need to sum integers and use 128
bits to prevent overflows then gpr is correct choice due to transfer
cost.
> Also, it is up to developers to select the best way to use the CPU
> features. Although I am not very found of providing the hwcap in TCB
> (my suggestion was to use local __thread in libgcc instead), the idea
> here is to provide *tools*.
>
If you want to provide tools then you should try to make best tool
possible instead of being satisfied with tool that poorly fits job and
is dangerous to use.
I am telling all time that there are better alternatives where this
doesn't matter.
One example would be write gcc pass that runs after early inlining to
find all functions containing __builtin_cpu_supports, cloning them to
replace it by constant and adding ifunc to automatically select variant.
You would also need to keep list of existing processor features to
remove nonexisting combinations. That easiest way to avoid combinatorial
explosion.
> [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html
>
> >
> > As gcc compiles addition into pair of addc, adde instructions a
> > performance gain is minimal while code is harder to maintain. Due to
> > pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one
> > on following example on power8.
> >
> >
> > int main()
> > {
> > unsigned long i;
> > __int128 u = 0;
> > //long u = 0;
> > for (i = 0; i < 1000000000; i++)
> > u += i * i;
> > return u >> 35;
> > }
> >
> > [neleai@gcc2-power8 ~]$ gcc uu.c -O3
> > [neleai@gcc2-power8 ~]$ time ./a.out
> >
> > real 0m0.957s
> > user 0m0.956s
> > sys 0m0.001s
> >
> > [neleai@gcc2-power8 ~]$ vim uu.c
> > [neleai@gcc2-power8 ~]$ gcc uu.c -O3
> > [neleai@gcc2-power8 ~]$ time ./a.out
> >
> > real 0m1.040s
> > user 0m1.039s
> > sys 0m0.001s
>
> This is due the code is not using any vector instruction, which is the aim of the
> code snippet Steven has posted.
Wait do you want to have fast code or just show off your elite skills
with vector registers?
A vector 128bit addition is on power7 lot slower than 128bit addition in
gpr. This is valid use case when I produce 64bit integers and want to
compute their sum in 128bit variable. You could construct lot of use
cases where gpr wins, for example summing an array(possibly with applied
arithmetic expression).
Unless you show real world examples how could you prove that vector
registers are better choice?
> Also, it really depends in which mode the CPU is
> set, on a POWER split-core mode, where the CPU dispatch groups are shared among
> threads in an non-dynamic way the difference is bigger:
>
> [[fedora@glibc-ppc64le ~]$ time ./test
>
> real 0m1.730s
> user 0m1.726s
> sys 0m0.003s
> [fedora@glibc-ppc64le ~]$ time ./test-long
>
> real 0m1.593s
> user 0m1.591s
> sys 0m0.002s
>
Difference? What difference? Only ratio matters to remove things like
different frequency of processors and that thread sharing slows you down
by constant. When I do math difference between these two ratios is 0.06%
1.593/1.730 = 0.9208092485549133
0.957/1.040 = 0.9201923076923076
> >
> >
> >
> >> One instruction (plus the __builtin_cpu_supports which should be and
> >> immediate, branch conditional) is a better deal. Inlining so the
> >> compiler can do common sub-expression about larger blocks is an even
> >> better deal.
> >>
> > That doesn't change fact that its mistake. A code above was bad as it
> > added check for single instruction that takes a cycle. When difference
> > between implementations is few cycles then each cycle matter (otherwise
> > you should just stick to generic one). Then a hwcap check itself causes
> > slowdown that matters and you should use ifunc to eliminate.
> >
> > Or hope that its moved out of loop, but when its loop with 100
> > iterations a __builtin_cpu_supports time becomes imaterial.
> >
> >
> >> I just do not understand why there is so much resistance to this simple
> >> platform ABI specific request.
> >
> >