This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB



On 30-06-2015 18:15, OndÅej BÃlka wrote:
> On Tue, Jun 30, 2015 at 11:09:20AM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 30-06-2015 00:14, OndÅej BÃlka wrote:
>>> On Mon, Jun 29, 2015 at 06:48:19PM -0300, Adhemerval Zanella wrote:
>  > 
>>> If you still believe that it somehow does multiplication just try this
>>> and see that result is all zeroes.
>>>
>>>    __vector uint32_t x={3,2,0,3},y={0,0,0,0};
>>>    y = vec_addcuq(x,x);
>>>    printf("%i %i %i %i\n",y[0], y[1],y[2],y[3]);
>>>
>>> Again your patronizing tone only shows your lack of knowledge of powerpc
>>> assembly. Please study https://www.power.org/documentation/power-isa-v-2-07b/
>>
>> Seriously, you need to start admitting your lack of knowledge in PowerISA
>> (I am meant addition instead of multiplication, my mistake).  And repeating
>> myself to prove a point only makes you childish, I am not competing with
>> you.
>>
> It sound exactly as silly as your critique that was based on lie. Now
> you are saying: Oops my mistake. But I was rigth. To see if one is rigth
> or wrong is to present evidence. So whats yours?

I really do not want to go further on this path, so I will just dropped it.


> 
>>>
>>>
>>> I did mistake that I read to bit fast and seen only add instead of
>>> instruction to get carry. Still thats with gpr two additions with carry,
>>> then add zero with carry to set desired bit. 
>>>
>>>> It has nothing to do
>>>> with uint128_t support on GCC and only recently GCC added support to 
>>>> such builtins [1]. And although there is plan to add support to use 
>>>> vector instruction for uint128_t, right now they are done in GRP register
>>>> in powerpc.
>>>>
>>> Customer just wants to do 128 additions. If a fastest way
>>> is with GPR registers then he should use gpr registers.
>>>
>>> My claim was that this leads to slow code on power7. Fallback above
>>> takes 14 cycles on power8 and 128bit addition is similarly slow.
>>>
>>> Yes you could craft expressions that exploit vectors by doing ands/ors
>>> with 128bit constants but if you mostly need to sum integers and use 128
>>> bits to prevent overflows then gpr is correct choice due to transfer
>>> cost.
>>
>> Again this is something, as Steve has pointed out, you only assume without
>> knowing the subject in depth: it is operating on *vector* registers and
>> thus it will be more costly to move to and back GRP than just do in
>> VSX registers.  And as Steven has pointed out, the idea is to *validate*
>> on POWER7.
> 
> If that is really case then using hwcap for that makes absolutely no sense.
> Just surround these builtins by #ifdef TESTING and you will compile
> power7 binary. When you are releasing production version you will
> optimize that for power8. A difference from just using correct -mcpu
> could dominate speedups that you try to get with these builtins. Slowing
> down production application for validation support makes no sense.

That is a valid point, but as Steve has pointed out the idea is exactly
to avoid multiple builds.

> 
> 
> Also you didn't answered my question, it works in both ways. 
> From that example his uses vector register doesn't follow that 
> application should use vector registers. If user does
> something like in my example, the cost of gpr -> vector conversion will
> harm performance and he should keep these in gpr. 

And again you make assumptions that you do not know: what if the program
is made with vectors in mind and they want to process it as uint128_t if
it is the case?  You do know that neither the program constraints so
assuming that it would be better to use GPR may not hold true.

> 
> 
> 
> 
> 
> 
>>>
>>>> Also, it is up to developers to select the best way to use the CPU
>>>> features.  Although I am not very found of providing the hwcap in TCB
>>>> (my suggestion was to use local __thread in libgcc instead), the idea
>>>> here is to provide *tools*.
>>>>
>>> If you want to provide tools then you should try to make best tool
>>> possible instead of being satisfied with tool that poorly fits job and
>>> is dangerous to use.
>>>
>>> I am telling all time that there are better alternatives where this
>>> doesn't matter.
>>>
>>> One example would be write gcc pass that runs after early inlining to
>>> find all functions containing __builtin_cpu_supports, cloning them to
>>> replace it by constant and adding ifunc to automatically select variant.
>>
>> Using internal PLT calls to such mechanism is really not the way to handle
>> performance for powerpc.  
>>
> No you are wrong again. I wrote to introduce ifunc after inlining. You
> do inlining to eliminate call overhead. So after inlining effect of
> adding plt call is minimal, otherwise gcc should inline that to improve
> performance in first place.

It is the case if you have the function definition, which might not be
true.  But this is not the case since the code could be in a shared
library.

> 
> Also why are you so sure that its code in main binary and not code in
> shared library?
> 
>>>
>>> You would also need to keep list of existing processor features to
>>> remove nonexisting combinations. That easiest way to avoid combinatorial
>>> explosion.
>>>
>>>
>>>
>>>  
>>>> [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html
>>>>
>>>>>
>>>>> As gcc compiles addition into pair of addc, adde instructions a
>>>>> performance gain is minimal while code is harder to maintain. Due to
>>>>> pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one
>>>>> on following example on power8.
>>>>>
>>>>>
>>>>> int main()
>>>>> {
>>>>>   unsigned long i;
>>>>>   __int128 u = 0;
>>>>> //long u = 0;
>>>>>   for (i = 0; i < 1000000000; i++)
>>>>>     u += i * i;
>>>>>   return u >> 35;
>>>>> }
>>>>>
>>>>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
>>>>> [neleai@gcc2-power8 ~]$ time ./a.out 
>>>>>
>>>>> real	0m0.957s
>>>>> user	0m0.956s
>>>>> sys	0m0.001s
>>>>>
>>>>> [neleai@gcc2-power8 ~]$ vim uu.c 
>>>>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3
>>>>> [neleai@gcc2-power8 ~]$ time ./a.out 
>>>>>
>>>>> real	0m1.040s
>>>>> user	0m1.039s
>>>>> sys	0m0.001s
>>>>
>>>> This is due the code is not using any vector instruction, which is the aim of the
>>>> code snippet Steven has posted.
>>>
>>> Wait do you want to have fast code or just show off your elite skills
>>> with vector registers?
>>
>> What does it have to do with vectors? I just saying that in split-core mode
>> the CPU group dispatches are statically allocated for the eight threads
>> and thus pipeline gain are lower.  And indeed it was not the case for the
>> example (I rushed without doing the math, my mistake again).
>>
> And you are telling that in majority of time contested threads would be
> problem? Do you have statistic how often that happens?
> 
> Then I would be more worried about vector implementation than gpr one.
> It goes both ways. A slowdown in gpr code is relatively unlikely for
> simple economic reasons: As addition, shifts... are frequent
> intstruction one of best performance/silicon tradeoff is add more
> execution units that do that until slowdown become unlikely. On other
> hand for rarely used instructions that doesn't make sense so I wouldn't
> be much surprised that when all threads would do 128bit vector addition it 
> would get slow as they contest only one execution unit that could do
> that. 

Seriously, split-core is not really about contested threads, but rather
a way to set the core specially in KVM mode.  But we digress here, since
the idea is not analyse Steve code snippet if this is faster, better, etc;
but rather if hwcap using TCB access is better way to handle such compiler
builtin.

> 
> 
> 
>>>
>>> A vector 128bit addition is on power7 lot slower than 128bit addition in
>>> gpr. This is valid use case when I produce 64bit integers and want to
>>> compute their sum in 128bit variable. You could construct lot of use
>>> cases where gpr wins, for example summing an array(possibly with applied
>>> arithmetic expression).
>>>
>>> Unless you show real world examples how could you prove that vector
>>> registers are better choice?
>>
>> How said they are better? As Steve has pointed out, *you* assume it, the
>> idea afaik is only to be able to *validate* the code on a POWER7 machine.
>>
>> Anyway, I will conclude again because I am not in the mood to get back
>> at this subject (you can be the big boy and have the final line).
>> I tend to see the TCB is not the way to accomplish it, but not for
>> performance reasons.  My main issue is tie compiler code generation ABI
>> with runtime in a way it should be avoided (for instance implementing it
>> on libgcc).  And your performance analysis mostly do not hold true for
>> powerpc.
>>
> You could repeat it but could you prove it?

Again I do not want to go on this patch ...


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]