This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: munroesj at linux dot vnet dot ibm dot com
- Cc: Richard Henderson <rth at twiddle dot net>, Szabolcs Nagy <szabolcs dot nagy at arm dot com>, Carlos Eduardo Seo <cseo at linux dot vnet dot ibm dot com>, GLIBC Devel <libc-alpha at sourceware dot org>, Steve Munroe <sjmunroe at us dot ibm dot com>
- Date: Mon, 29 Jun 2015 23:18:31 +0200
- Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- Authentication-results: sourceware.org; auth=none
- References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <5576FC80 dot 1090806 at arm dot com> <1433862393 dot 21101 dot 9 dot camel at sjmunroe-ThinkPad-W500> <5591239A dot 9030907 at twiddle dot net> <1435603025 dot 5485 dot 23 dot camel at oc7878010663>
On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote:
> Lets look at a real customer example. The customer wants to use the P8
> 128-bit add/sub but also wants to be able to unit test code on existing
> P7 machines. Which results in something like this:
>
> static inline vui32_t
> vec_addcuq (vui32_t a, vui32_t b)
> {
> vui32_t t;
>
> if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))
> {
>
> __asm__(
> "vaddcuq %0,%1,%2;"
> : "=v" (t)
> : "v" (a),
> "v" (b)
> : );
> }
> else
> vui32_t c, c2, co;
> vui32_t z= {0,0,0,0};
> __asm__(
> "vaddcuw %3,%4,%5;\n"
> "\tvadduwm %0,%4,%5;\n"
> "\tvsldoi %1,%3,%6,4;\n"
> "\tvaddcuw %2,%0,%1;\n"
> "\tvadduwm %0,%0,%1;\n"
> "\tvor %3,%3,%2;\n"
> "\tvsldoi %1,%2,%6,4;\n"
> "\tvaddcuw %2,%0,%1;\n"
> "\tvadduwm %0,%0,%1;\n"
> "\tvor %3,%3,%2;\n"
> "\tvsldoi %1,%2,%6,4;\n"
> "\tvadduwm %0,%0,%1;\n"
> : "=&v" (t), /* 0 */
> "=&v" (c), /* 1 */
> "=&v" (c2), /* 2 */
> "=&v" (co) /* 3 */
> : "v" (a), /* 4 */
> "v" (b), /* 5 */
> "v" (z) /* 6 */
> : );
> t = co;
> }
> return (t);
> }
>
> So it is clear to me that executing 14+ instruction to decide if I can
> optimize to use new single instruction optimization is not a good deal.
>
No, this is prime example that average programmers shouldn't use hwcap
as that results in moronic code like this.
When you poorly reinvent wheel you will get terrible performance like
fallback here. Gcc already has 128 ints so tell average programmers to
use them instead and don't touch features that they don't understand.
As gcc compiles addition into pair of addc, adde instructions a
performance gain is minimal while code is harder to maintain. Due to
pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one
on following example on power8.
int main()
{
unsigned long i;
__int128 u = 0;
//long u = 0;
for (i = 0; i < 1000000000; i++)
u += i * i;
return u >> 35;
}
[neleai@gcc2-power8 ~]$ gcc uu.c -O3
[neleai@gcc2-power8 ~]$ time ./a.out
real 0m0.957s
user 0m0.956s
sys 0m0.001s
[neleai@gcc2-power8 ~]$ vim uu.c
[neleai@gcc2-power8 ~]$ gcc uu.c -O3
[neleai@gcc2-power8 ~]$ time ./a.out
real 0m1.040s
user 0m1.039s
sys 0m0.001s
> One instruction (plus the __builtin_cpu_supports which should be and
> immediate, branch conditional) is a better deal. Inlining so the
> compiler can do common sub-expression about larger blocks is an even
> better deal.
>
That doesn't change fact that its mistake. A code above was bad as it
added check for single instruction that takes a cycle. When difference
between implementations is few cycles then each cycle matter (otherwise
you should just stick to generic one). Then a hwcap check itself causes
slowdown that matters and you should use ifunc to eliminate.
Or hope that its moved out of loop, but when its loop with 100
iterations a __builtin_cpu_supports time becomes imaterial.
> I just do not understand why there is so much resistance to this simple
> platform ABI specific request.