This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- From: Torvald Riegel <triegel at redhat dot com>
- To: munroesj at linux dot vnet dot ibm dot com
- Cc: Richard Henderson <rth at twiddle dot net>, Szabolcs Nagy <szabolcs dot nagy at arm dot com>, Carlos Eduardo Seo <cseo at linux dot vnet dot ibm dot com>, GLIBC Devel <libc-alpha at sourceware dot org>, Steve Munroe <sjmunroe at us dot ibm dot com>
- Date: Tue, 30 Jun 2015 18:01:07 +0200
- Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB
- Authentication-results: sourceware.org; auth=none
- References: <55760314 dot 6070601 at linux dot vnet dot ibm dot com> <5576FC80 dot 1090806 at arm dot com> <1433862393 dot 21101 dot 9 dot camel at sjmunroe-ThinkPad-W500> <5591239A dot 9030907 at twiddle dot net> <1435603025 dot 5485 dot 23 dot camel at oc7878010663> <55923C17 dot 6000400 at twiddle dot net> <1435676863 dot 10780 dot 24 dot camel at oc7878010663>
On Tue, 2015-06-30 at 10:07 -0500, Steven Munroe wrote:
> On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote:
> > On 06/29/2015 07:37 PM, Steven Munroe wrote:
> > > On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote:
> > >> On 06/09/2015 04:06 PM, Steven Munroe wrote:
> > >>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > >>>>
> > >>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > >>>>> The proposed patch adds a new feature for powerpc. In order to get
> > >>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > >>>>> This enables users to write versioned code based on the HWCAP bits
> > >>>>> without going through the overhead of reading them from the auxiliary
> > >>>>> vector.
> > >>>
> > >>>> i assume this is for multi-versioning.
> > >>>
> > >>> The intent is for the compiler to implement the equivalent of
> > >>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > >>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > >>> efficiently as getauxv and scanning the auxv is too slow for inline
> > >>> optimizations.
> > >>>
> > >>
> > >> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads
> > >> the variables private to glibc that already contain this information. That
> > >> ought to be fast enough for the builtin, rather than consuming space in the TCB.
> > >>
> > >
> > > Richard I do not understand how a 38 instruction function accessed via a
> > > PLT call stub (minimum 4 additional instructions) is equivalent or "as
> > > good as" a single in-line load instruction.
> > >
> > > Even with best case path for getauxval HWCAP2 we are at 14 instructions
> > > with exposure to 3 different branch miss predicts. And that is before
> > > the application can execute its own __builtin_cpu_supports() test.
> > >
> > > Lets look at a real customer example. The customer wants to use the P8
> > > 128-bit add/sub but also wants to be able to unit test code on existing
> > > P7 machines. Which results in something like this:
> > >
> > > static inline vui32_t
> > > vec_addcuq (vui32_t a, vui32_t b)
> > > {
> > > vui32_t t;
> > >
> > > if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))
> > > {
> > >
> > > __asm__(
> > > "vaddcuq %0,%1,%2;"
> > > : "=v" (t)
> > > : "v" (a),
> > > "v" (b)
> > > : );
> > ...
> > >
> > > So it is clear to me that executing 14+ instruction to decide if I can
> > > optimize to use new single instruction optimization is not a good deal.
> >
> > This is a horrible way to use this builtin. In the same way that using ifunc at
> > this level would also be horrible.
> >
> Yes it is just an example, there are many more that you might find less
> objectionable.
Could you give more examples that give a clearer picture why you think
that the concern that Richard raised isn't valid? Especially this here
regarding 2 vs 40 cycles:
> > Even supposing that this builtin uses a single load, you've at least doubled
> > the overhead of using the insn. The user really should be aware of this and
> > manually hoist this check much farther up the call chain. At which point the
> > difference between 2 cycles for a load and 40 cycles for a call is immaterial.
> >
> > And if the user is really concerned about unit tests, surely ifdefs are more
> > appropriate for this situation. At the moment one can only test the P7 path on
> > P7 and the P8 path on P8; better if one can also test the P7 path on P8.
> >
>
> Yes I know there are better alternatives.
>
> This is not intended for use within GLIBC or by knowledgeable folks like
> yourself and the GLIBC community.
>
> This about application developers in other communities and users where I
> would settle for them to just support my platform with any optimization
> that is somewhat sane.
>
> The __Builtin_cpu_supports exist and I see its use. I don't see much use
> of the more "complicated" approaches that we use in GLIBC. So it seem
> reasonable to enable __builtin_cpu_supports for POWER but define the
> implementation to be optimal for the PowerPC platform.
>
> Most the argument against seems to be based on assumed "moral hazard".
> Where you think what they are doing is stupid and so you refuse to help
> them with any mechanisms that might make what they are doing, and will
> continue to do, a little less stupid.
I didn't understand Richard's concerns to be about that. Rather, it
seemed to me he's concerned about supporting use cases that only mean
technical debt for us; if there is a much simpler way on the users' side
to do this right, we have to see whether we get a good balance between
technical debt and benefits for some users.
> I appreciate the concern, but think this is odd position for a community
> that uses phrases like "Free as in Freedom" to describe what they do.
I don't think we promise to do everything for everyone. That does not
conflict with free software.