This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: HWCAP is method to determine cpu features, not selection mechanism.
- From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- To: OndÅej BÃlka <neleai at seznam dot cz>, Steven Munroe <munroesj at linux dot vnet dot ibm dot com>
- Cc: Szabolcs Nagy <szabolcs dot nagy at arm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Date: Wed, 10 Jun 2015 19:09:11 -0300
- Subject: Re: HWCAP is method to determine cpu features, not selection mechanism.
- Authentication-results: sourceware.org; auth=none
- References: <20150609154223 dot GA20028 at domone> <1433865684 dot 21101 dot 20 dot camel at sjmunroe-ThinkPad-W500> <20150610125047 dot GA10861 at domone> <55783D2A dot 8050703 at linaro dot org> <557846D9 dot 3060909 at arm dot com> <55784802 dot 8070605 at linaro dot org> <20150610150944 dot GA11504 at domone> <5578567C dot 5020504 at linaro dot org> <20150610155354 dot GA12820 at domone> <1433962707 dot 25475 dot 92 dot camel at sjmunroe-ThinkPad-W500> <20150610205604 dot GB11504 at domone>
On 10-06-2015 17:56, OndÅej BÃlka wrote:
> On Wed, Jun 10, 2015 at 01:58:27PM -0500, Steven Munroe wrote:
>> On Wed, 2015-06-10 at 17:53 +0200, OndÅej BÃlka wrote:
>>> On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 10-06-2015 12:09, OndÅej BÃlka wrote:
>>>>> On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote:
>>>>>>
>>>>>>
>>>>>> On 10-06-2015 11:16, Szabolcs Nagy wrote:
>>>>>>> On 10/06/15 14:35, Adhemerval Zanella wrote:
>>>>>>>> I agree that adding an API to modify the current hwcap is not a good
>>>>>>>> approach. However the cost you are assuming here are *very* x86 biased,
>>>>>>>> where you have only on instruction (movl <variable>(%rip), %<destiny>)
>>>>>>>> to load an external variable defined in a shared library, where for
>>>>>>>> powerpc it is more costly:
>>>>>>>
>>>>>>> debian codesearch found 4 references to __builtin_cpu_supports
>>>>>>> all seem to avoid using it repeatedly.
>>>>>>>
>>>>>>> multiversioning dispatch only happens at startup (for a small
>>>>>>> number of functions according to existing practice).
>>>>>>>
>>>>>>> so why is hwcap expected to be used in hot loops?
>>>>>>>
>>>>>>
>>> snip
>>>> And my understanding is to optimize hwcap access to provide a 'better' way
>>>> to enable '__builtin_cpu_supports' for powerpc. IFUNC is another way to provide
>>>> function selection, but it does not exclude that accessing hwcap through
>>>> TLS is *faster* than current options. It is up to developer to decide to use
>>>> either IFUNC or __builtin_cpu_supports. If the developer will use it in
>>>> hot loops or not, it is up to them to profile and use another way.
>>>>
>>>> You can say the same about current x86 __builtin_cpu_supports support: you should
>>>> not use in loops, you should use ifunc, whatever.
>>>
>>> Sorry but no again. We are talking here on difference between variable
>>> access and tcb access. You forgot to count total cost. That includes
>>> initialization overhead per thread to initialize hwcap, increased
>>> per-thread memory usage, maintainance burden and increased cache misses.
>>> If you access hwcap only rarely as you should then per-thread copies
>>> would introduce cache miss that is more costy than GOT overhead. In GOT
>>> case it could be avoided as combined threads would access it more often.
>>>
>> Actually Adhemerval does have the knowledge, background, and experience
>> to understand this difference and accurately access the trade-offs.
>>
> While he may have background he didn't cover drawbacks. So I needed to
> point them out to start discussing cost-benefit analysis instead looking
> at them with rose glasses.
>
What I did was to pointed out your earlier analysis related to instruction
latency was x86 biased and did not hold out for powerpc TOC cost model. I
was *not* advocating anything more neither saying this hwcap in TCB is the
best approach.
And I do see the raised points you brought as valid, but IMHO this kind of
discussion will stretch without end mainly because it is based on assumptions
and tradeoffs.
Now, my opinion for powerpc to implement __builtin_cpu_supports is to similar
to x86 by adding it on libgcc and using initial executable TLS variables. It
will create 2 dynamic relocations (R_PPC64_TPREL16_HI and R_PPC64_TPREL16_LO),
but the access will require 2 arithmetic instruction and 1 load. It will
decouple the implementation from GLIBC and not required any more TCB fields.