This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB


On Tue, 2015-06-30 at 10:07 -0500, Steven Munroe wrote:
> On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote:
> > On 06/29/2015 07:37 PM, Steven Munroe wrote:
> > > On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote:
> > >> On 06/09/2015 04:06 PM, Steven Munroe wrote:
> > >>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > >>>>
> > >>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > >>>>> The proposed patch adds a new feature for powerpc. In order to get
> > >>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > >>>>> This enables users to write versioned code based on the HWCAP bits
> > >>>>> without going through the overhead of reading them from the auxiliary
> > >>>>> vector.
> > >>>
> > >>>> i assume this is for multi-versioning.
> > >>>
> > >>> The intent is for the compiler to implement the equivalent of
> > >>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > >>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > >>> efficiently as getauxv and scanning the auxv is too slow for inline
> > >>> optimizations.
> > >>>
> > >>
> > >> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads
> > >> the variables private to glibc that already contain this information.  That
> > >> ought to be fast enough for the builtin, rather than consuming space in the TCB.
> > >>
> > >
> > > Richard I do not understand how a 38 instruction function accessed via a
> > > PLT call stub (minimum 4 additional instructions) is equivalent or "as
> > > good as" a single in-line load instruction.
> > >
> > > Even with best case path for getauxval HWCAP2 we are at 14 instructions
> > > with exposure to 3 different branch miss predicts. And that is before
> > > the application can execute its own __builtin_cpu_supports() test.
> > >
> > > Lets look at a real customer example. The customer wants to use the P8
> > > 128-bit add/sub but also wants to be able to unit test code on existing
> > > P7 machines. Which results in something like this:
> > >
> > > static inline vui32_t
> > > vec_addcuq (vui32_t a, vui32_t b)
> > > {
> > >          vui32_t t;
> > >
> > >                  if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))
> > >                  {
> > >
> > >                          __asm__(
> > >                              "vaddcuq %0,%1,%2;"
> > >                              : "=v" (t)
> > >                              : "v" (a),
> > >                                "v" (b)
> > >                              : );
> > ...
> > >
> > > So it is clear to me that executing 14+ instruction to decide if I can
> > > optimize to use new single instruction optimization is not a good deal.
> > 
> > This is a horrible way to use this builtin. In the same way that using ifunc at 
> > this level would also be horrible.
> > 
> Yes it is just an example, there are many more that you might find less
> objectionable.

Could you give more examples that give a clearer picture why you think
that the concern that Richard raised isn't valid?  Especially this here
regarding 2 vs 40 cycles:

> > Even supposing that this builtin uses a single load, you've at least doubled 
> > the overhead of using the insn. The user really should be aware of this and 
> > manually hoist this check much farther up the call chain.  At which point the 
> > difference between 2 cycles for a load and 40 cycles for a call is immaterial.
> >
> > And if the user is really concerned about unit tests, surely ifdefs are more 
> > appropriate for this situation.  At the moment one can only test the P7 path on 
> > P7 and the P8 path on P8; better if one can also test the P7 path on P8.
> > 
> 
> Yes I know there are better alternatives.
> 
> This is not intended for use within GLIBC or by knowledgeable folks like
> yourself and the GLIBC community. 
> 
> This about application developers in other communities and users where I
> would settle for them to just support my platform with any optimization
> that is somewhat sane.
> 
> The __Builtin_cpu_supports exist and I see its use. I don't see much use
> of the more "complicated" approaches that we use in GLIBC. So it seem
> reasonable to enable __builtin_cpu_supports for POWER but define the
> implementation to be optimal for the PowerPC platform.
> 
> Most the argument against seems to be based on assumed "moral hazard".
> Where you think what they are doing is stupid and so you refuse to help
> them with any mechanisms that might make what they are doing, and will
> continue to do, a little less stupid.

I didn't understand Richard's concerns to be about that.  Rather, it
seemed to me he's concerned about supporting use cases that only mean
technical debt for us; if there is a much simpler way on the users' side
to do this right, we have to see whether we get a good balance between
technical debt and benefits for some users.

> I appreciate the concern, but think this is odd position for a community
> that uses phrases like "Free as in Freedom" to describe what they do.

I don't think we promise to do everything for everyone.  That does not
conflict with free software.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]