This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: HWCAP is method to determine cpu features, not selection mechanism.

From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
To: libc-alpha at sourceware dot org
Date: Wed, 10 Jun 2015 10:35:38 -0300
Subject: Re: HWCAP is method to determine cpu features, not selection mechanism.
Authentication-results: sourceware.org; auth=none
References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <5576FC80 dot 1090806 at arm dot com> <1433862393 dot 21101 dot 9 dot camel at sjmunroe-ThinkPad-W500> <20150609154223 dot GA20028 at domone> <1433865684 dot 21101 dot 20 dot camel at sjmunroe-ThinkPad-W500> <20150610125047 dot GA10861 at domone>


On 10-06-2015 09:50, OndÅej BÃlka wrote:
> On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote:
>> On Tue, 2015-06-09 at 17:42 +0200, OndÅej BÃlka wrote:
>>> On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
>>>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
>>>>>
>>>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
>>>>>> The proposed patch adds a new feature for powerpc. In order to get
>>>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
>>>>>> This enables users to write versioned code based on the HWCAP bits
>>>>>> without going through the overhead of reading them from the auxiliary
>>>>>> vector.
>>>>
>>>>> i assume this is for multi-versioning.
>>>>
>>>> The intent is for the compiler to implement the equivalent of
>>>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
>>>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
>>>> efficiently as getauxv and scanning the auxv is too slow for inline
>>>> optimizations.
>>>>
>>>>> i dont see how the compiler can generate code to access the
>>>>> hwcap bits currently (without making assumptions about libc
>>>>> interfaces).
>>>>>
>>>> These offset will become a durable part the PowerPC 64-bit ELF V2 ABI.
>>>>
>>>> The TCB offsets are already fixed and can not change from release to
>>>> release.
>>>>
>>> I don't have problem with this but why do you add tls, how can different
>>> threads have different ones when kernel could move them between cores.
>>>
>>> So instead we just add to libc api following two variables below. These would
>>> be initialized by linker as we will probably use them internally.
>>>
>>> extern int __hwcap, __hwcap2;
>>>
>> The Power ABI's address the TCB off a dedicated GPR (R2 or R13). This
>> guarantees one instruction load from TCB.
>>
>> A Static variable would require a an indirect load via the TOC/GOT
>> (which can be megabytes for a large program/library). I really really
>> want the avoid that.
>>
>> The point is to make fast decisions about which code the execute.
>> STT_GNU_IFUNC is just too complication for most application programmers
>> to use.
>>
>> Now if the GLIBC community wants to provide a durable API for static
>> access to the HWCAP. I have not problem with that, but it does not solve
>> this problem.
>>
> Thats completely false and outright dangerous advice.
> 
> First that if ifuncs are too much complication to use they shouldn't
> touch hwcap at first place. Ifuncs are relatively easy to read if you
> take optimizing for specific cpu seriously and are aware of precautions
> you could take.
> 
> If you let other programmers touch hwcap you would get disaster. You
> need to compile each variant separately with appropriate gcc flags.
> Otherwise if you just do decision inline then compiler is free to insert
> newer instructions to generic code. That could lead to unexpected
> crashes caused just by compiling with different gcc than original
> programmer used.
> 
> So you need to have different file for each enabled capability and
> compile these separately. (Or use assembly but most programmers don't
> qualify.) Or you could try to add pragmas to tell gcc which part of file
> should be optimized with which optimizations but thats even worse that
> ifunc.
> 
> So you read hwcap register and need to call function. That indirection
> already costs you more than GOT access you tried to save. 

I agree that adding an API to modify the current hwcap is not a good
approach. However the cost you are assuming here are *very* x86 biased,
where you have only on instruction (movl <variable>(%rip), %<destiny>) 
to load an external variable defined in a shared library, where for
powerpc it is more costly:

extern int foo;

int bar ()
{
  return foo;
}

	.type	bar, @function
bar:
0:	addis 2,12,.TOC.-0b@ha
	addi 2,2,.TOC.-0b@l
	.localentry	bar,.-bar
	addis 9,2,.LC0@toc@ha		# gpr load fusion, type long
	ld 9,.LC0@toc@l(9)
	lwa 3,0(9)
	blr


So you need a 2 arithmetic instruction to materialize the TOC, plus 
an addis+ld to load the load and then another load to load the external
variable (you have a optimization where the symbol call is local, where
you do not need to materialize the TOC). That is the *exactly* the cost 
Steven is trying to avoid.

> 
> Also even if you could handle previous problems with assembly functions
> you lose more cycles than save as you couldn't compile file with
> -march=native. Best solution I found would be distributions package
> gentoo model, have variant of package for each cpu that would package
> manager fetch based on your cpu and a script on startup that checks if
> cpu changed and if so then he would relink all packages to generic
> versions.
> 
> That would allow programmers use #ifdef _HAS_SSE4 for code thats easier
> to maintain.
> 

The relink strategy seems reasonable, but still the provider of
packages should build all the pre-compiled objects for each CPU variant.
This is what usual powerpc distro have done for some time: CPU variant
libc/libm/etc that are selects during runtime using hwcap. And the ifunc
idea is exactly to avoid such different CPU DSO variants.

> Finally while Florian solution works your argument is suspect. First it
> costs tls so it needs to be frequently used. That makes address always
> be in L1 cache which makes GOT size irrelevant. And if you have problems
> with hwcap not being in cache duplicating it ten times if you have ten
> threads would make situation worse, not better.

Again you are being x86 biased: the idea is a tradeoff between hwcap size
for each thread against its access speed using TLS. Steve is advocating
that he prefer to have the latency.

Follow-Ups:
- Re: HWCAP is method to determine cpu features, not selection mechanism.
  - From: Szabolcs Nagy

References:
- [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
  - From: Carlos Eduardo Seo
- Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
  - From: Szabolcs Nagy
- Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
  - From: Steven Munroe
- Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
  - From: OndÅej BÃlka
- Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
  - From: Steven Munroe
- HWCAP is method to determine cpu features, not selection mechanism.
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]