This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: HWCAP is method to determine cpu features, not selection mechanism.



On 10-06-2015 09:50, OndÅej BÃlka wrote:
> On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote:
>> On Tue, 2015-06-09 at 17:42 +0200, OndÅej BÃlka wrote:
>>> On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
>>>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
>>>>>
>>>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
>>>>>> The proposed patch adds a new feature for powerpc. In order to get
>>>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
>>>>>> This enables users to write versioned code based on the HWCAP bits
>>>>>> without going through the overhead of reading them from the auxiliary
>>>>>> vector.
>>>>
>>>>> i assume this is for multi-versioning.
>>>>
>>>> The intent is for the compiler to implement the equivalent of
>>>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
>>>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
>>>> efficiently as getauxv and scanning the auxv is too slow for inline
>>>> optimizations.
>>>>
>>>>> i dont see how the compiler can generate code to access the
>>>>> hwcap bits currently (without making assumptions about libc
>>>>> interfaces).
>>>>>
>>>> These offset will become a durable part the PowerPC 64-bit ELF V2 ABI.
>>>>
>>>> The TCB offsets are already fixed and can not change from release to
>>>> release.
>>>>
>>> I don't have problem with this but why do you add tls, how can different
>>> threads have different ones when kernel could move them between cores.
>>>
>>> So instead we just add to libc api following two variables below. These would
>>> be initialized by linker as we will probably use them internally.
>>>
>>> extern int __hwcap, __hwcap2;
>>>
>> The Power ABI's address the TCB off a dedicated GPR (R2 or R13). This
>> guarantees one instruction load from TCB.
>>
>> A Static variable would require a an indirect load via the TOC/GOT
>> (which can be megabytes for a large program/library). I really really
>> want the avoid that.
>>
>> The point is to make fast decisions about which code the execute.
>> STT_GNU_IFUNC is just too complication for most application programmers
>> to use.
>>
>> Now if the GLIBC community wants to provide a durable API for static
>> access to the HWCAP. I have not problem with that, but it does not solve
>> this problem.
>>
> Thats completely false and outright dangerous advice.
> 
> First that if ifuncs are too much complication to use they shouldn't
> touch hwcap at first place. Ifuncs are relatively easy to read if you
> take optimizing for specific cpu seriously and are aware of precautions
> you could take.
> 
> If you let other programmers touch hwcap you would get disaster. You
> need to compile each variant separately with appropriate gcc flags.
> Otherwise if you just do decision inline then compiler is free to insert
> newer instructions to generic code. That could lead to unexpected
> crashes caused just by compiling with different gcc than original
> programmer used.
> 
> So you need to have different file for each enabled capability and
> compile these separately. (Or use assembly but most programmers don't
> qualify.) Or you could try to add pragmas to tell gcc which part of file
> should be optimized with which optimizations but thats even worse that
> ifunc.
> 
> So you read hwcap register and need to call function. That indirection
> already costs you more than GOT access you tried to save. 

I agree that adding an API to modify the current hwcap is not a good
approach. However the cost you are assuming here are *very* x86 biased,
where you have only on instruction (movl <variable>(%rip), %<destiny>) 
to load an external variable defined in a shared library, where for
powerpc it is more costly:

extern int foo;

int bar ()
{
  return foo;
}

	.type	bar, @function
bar:
0:	addis 2,12,.TOC.-0b@ha
	addi 2,2,.TOC.-0b@l
	.localentry	bar,.-bar
	addis 9,2,.LC0@toc@ha		# gpr load fusion, type long
	ld 9,.LC0@toc@l(9)
	lwa 3,0(9)
	blr


So you need a 2 arithmetic instruction to materialize the TOC, plus 
an addis+ld to load the load and then another load to load the external
variable (you have a optimization where the symbol call is local, where
you do not need to materialize the TOC). That is the *exactly* the cost 
Steven is trying to avoid.

> 
> Also even if you could handle previous problems with assembly functions
> you lose more cycles than save as you couldn't compile file with
> -march=native. Best solution I found would be distributions package
> gentoo model, have variant of package for each cpu that would package
> manager fetch based on your cpu and a script on startup that checks if
> cpu changed and if so then he would relink all packages to generic
> versions.
> 
> That would allow programmers use #ifdef _HAS_SSE4 for code thats easier
> to maintain.
> 

The relink strategy seems reasonable, but still the provider of
packages should build all the pre-compiled objects for each CPU variant.
This is what usual powerpc distro have done for some time: CPU variant
libc/libm/etc that are selects during runtime using hwcap. And the ifunc
idea is exactly to avoid such different CPU DSO variants.

> Finally while Florian solution works your argument is suspect. First it
> costs tls so it needs to be frequently used. That makes address always
> be in L1 cache which makes GOT size irrelevant. And if you have problems
> with hwcap not being in cache duplicating it ten times if you have ten
> threads would make situation worse, not better.

Again you are being x86 biased: the idea is a tradeoff between hwcap size
for each thread against its access speed using TLS. Steve is advocating
that he prefer to have the latency.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]