This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: HWCAP is method to determine cpu features, not selection mechanism.
- From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- To: libc-alpha at sourceware dot org
- Date: Wed, 10 Jun 2015 10:35:38 -0300
- Subject: Re: HWCAP is method to determine cpu features, not selection mechanism.
- Authentication-results: sourceware.org; auth=none
- References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <5576FC80 dot 1090806 at arm dot com> <1433862393 dot 21101 dot 9 dot camel at sjmunroe-ThinkPad-W500> <20150609154223 dot GA20028 at domone> <1433865684 dot 21101 dot 20 dot camel at sjmunroe-ThinkPad-W500> <20150610125047 dot GA10861 at domone>
On 10-06-2015 09:50, OndÅej BÃlka wrote:
> On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote:
>> On Tue, 2015-06-09 at 17:42 +0200, OndÅej BÃlka wrote:
>>> On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
>>>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
>>>>>
>>>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
>>>>>> The proposed patch adds a new feature for powerpc. In order to get
>>>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
>>>>>> This enables users to write versioned code based on the HWCAP bits
>>>>>> without going through the overhead of reading them from the auxiliary
>>>>>> vector.
>>>>
>>>>> i assume this is for multi-versioning.
>>>>
>>>> The intent is for the compiler to implement the equivalent of
>>>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
>>>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
>>>> efficiently as getauxv and scanning the auxv is too slow for inline
>>>> optimizations.
>>>>
>>>>> i dont see how the compiler can generate code to access the
>>>>> hwcap bits currently (without making assumptions about libc
>>>>> interfaces).
>>>>>
>>>> These offset will become a durable part the PowerPC 64-bit ELF V2 ABI.
>>>>
>>>> The TCB offsets are already fixed and can not change from release to
>>>> release.
>>>>
>>> I don't have problem with this but why do you add tls, how can different
>>> threads have different ones when kernel could move them between cores.
>>>
>>> So instead we just add to libc api following two variables below. These would
>>> be initialized by linker as we will probably use them internally.
>>>
>>> extern int __hwcap, __hwcap2;
>>>
>> The Power ABI's address the TCB off a dedicated GPR (R2 or R13). This
>> guarantees one instruction load from TCB.
>>
>> A Static variable would require a an indirect load via the TOC/GOT
>> (which can be megabytes for a large program/library). I really really
>> want the avoid that.
>>
>> The point is to make fast decisions about which code the execute.
>> STT_GNU_IFUNC is just too complication for most application programmers
>> to use.
>>
>> Now if the GLIBC community wants to provide a durable API for static
>> access to the HWCAP. I have not problem with that, but it does not solve
>> this problem.
>>
> Thats completely false and outright dangerous advice.
>
> First that if ifuncs are too much complication to use they shouldn't
> touch hwcap at first place. Ifuncs are relatively easy to read if you
> take optimizing for specific cpu seriously and are aware of precautions
> you could take.
>
> If you let other programmers touch hwcap you would get disaster. You
> need to compile each variant separately with appropriate gcc flags.
> Otherwise if you just do decision inline then compiler is free to insert
> newer instructions to generic code. That could lead to unexpected
> crashes caused just by compiling with different gcc than original
> programmer used.
>
> So you need to have different file for each enabled capability and
> compile these separately. (Or use assembly but most programmers don't
> qualify.) Or you could try to add pragmas to tell gcc which part of file
> should be optimized with which optimizations but thats even worse that
> ifunc.
>
> So you read hwcap register and need to call function. That indirection
> already costs you more than GOT access you tried to save.
I agree that adding an API to modify the current hwcap is not a good
approach. However the cost you are assuming here are *very* x86 biased,
where you have only on instruction (movl <variable>(%rip), %<destiny>)
to load an external variable defined in a shared library, where for
powerpc it is more costly:
extern int foo;
int bar ()
{
return foo;
}
.type bar, @function
bar:
0: addis 2,12,.TOC.-0b@ha
addi 2,2,.TOC.-0b@l
.localentry bar,.-bar
addis 9,2,.LC0@toc@ha # gpr load fusion, type long
ld 9,.LC0@toc@l(9)
lwa 3,0(9)
blr
So you need a 2 arithmetic instruction to materialize the TOC, plus
an addis+ld to load the load and then another load to load the external
variable (you have a optimization where the symbol call is local, where
you do not need to materialize the TOC). That is the *exactly* the cost
Steven is trying to avoid.
>
> Also even if you could handle previous problems with assembly functions
> you lose more cycles than save as you couldn't compile file with
> -march=native. Best solution I found would be distributions package
> gentoo model, have variant of package for each cpu that would package
> manager fetch based on your cpu and a script on startup that checks if
> cpu changed and if so then he would relink all packages to generic
> versions.
>
> That would allow programmers use #ifdef _HAS_SSE4 for code thats easier
> to maintain.
>
The relink strategy seems reasonable, but still the provider of
packages should build all the pre-compiled objects for each CPU variant.
This is what usual powerpc distro have done for some time: CPU variant
libc/libm/etc that are selects during runtime using hwcap. And the ifunc
idea is exactly to avoid such different CPU DSO variants.
> Finally while Florian solution works your argument is suspect. First it
> costs tls so it needs to be frequently used. That makes address always
> be in L1 cache which makes GOT size irrelevant. And if you have problems
> with hwcap not being in cache duplicating it ten times if you have ten
> threads would make situation worse, not better.
Again you are being x86 biased: the idea is a tradeoff between hwcap size
for each thread against its access speed using TLS. Steve is advocating
that he prefer to have the latency.