This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB


On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote:
> On 06/29/2015 07:37 PM, Steven Munroe wrote:
> > On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote:
> >> On 06/09/2015 04:06 PM, Steven Munroe wrote:
> >>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> >>>>
> >>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> >>>>> The proposed patch adds a new feature for powerpc. In order to get
> >>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> >>>>> This enables users to write versioned code based on the HWCAP bits
> >>>>> without going through the overhead of reading them from the auxiliary
> >>>>> vector.
> >>>
> >>>> i assume this is for multi-versioning.
> >>>
> >>> The intent is for the compiler to implement the equivalent of
> >>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> >>> is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> >>> efficiently as getauxv and scanning the auxv is too slow for inline
> >>> optimizations.
> >>>
> >>
> >> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads
> >> the variables private to glibc that already contain this information.  That
> >> ought to be fast enough for the builtin, rather than consuming space in the TCB.
> >>
> >
> > Richard I do not understand how a 38 instruction function accessed via a
> > PLT call stub (minimum 4 additional instructions) is equivalent or "as
> > good as" a single in-line load instruction.
> >
> > Even with best case path for getauxval HWCAP2 we are at 14 instructions
> > with exposure to 3 different branch miss predicts. And that is before
> > the application can execute its own __builtin_cpu_supports() test.
> >
> > Lets look at a real customer example. The customer wants to use the P8
> > 128-bit add/sub but also wants to be able to unit test code on existing
> > P7 machines. Which results in something like this:
> >
> > static inline vui32_t
> > vec_addcuq (vui32_t a, vui32_t b)
> > {
> >          vui32_t t;
> >
> >                  if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSXâ))
> >                  {
> >
> >                          __asm__(
> >                              "vaddcuq %0,%1,%2;"
> >                              : "=v" (t)
> >                              : "v" (a),
> >                                "v" (b)
> >                              : );
> ...
> >
> > So it is clear to me that executing 14+ instruction to decide if I can
> > optimize to use new single instruction optimization is not a good deal.
> 
> This is a horrible way to use this builtin. In the same way that using ifunc at 
> this level would also be horrible.
> 
Yes it is just an example, there are many more that you might find less
objectionable. But this is not about you or me.

> Even supposing that this builtin uses a single load, you've at least doubled 
> the overhead of using the insn. The user really should be aware of this and 
> manually hoist this check much farther up the call chain.  At which point the 
> difference between 2 cycles for a load and 40 cycles for a call is immaterial.
>
> And if the user is really concerned about unit tests, surely ifdefs are more 
> appropriate for this situation.  At the moment one can only test the P7 path on 
> P7 and the P8 path on P8; better if one can also test the P7 path on P8.
> 

Yes I know there are better alternatives.

This is not intended for use within GLIBC or by knowledgeable folks like
yourself and the GLIBC community. 

This about application developers in other communities and users where I
would settle for them to just support my platform with any optimization
that is somewhat sane.

The __Builtin_cpu_supports exist and I see its use. I don't see much use
of the more "complicated" approaches that we use in GLIBC. So it seem
reasonable to enable __builtin_cpu_supports for POWER but define the
implementation to be optimal for the PowerPC platform.

Most the argument against seems to be based on assumed "moral hazard".
Where you think what they are doing is stupid and so you refuse to help
them with any mechanisms that might make what they are doing, and will
continue to do, a little less stupid.

I appreciate the concern, but think this is odd position for a community
that uses phrases like "Free as in Freedom" to describe what they do.

I think it is better to help all communities do things in a less stupid
(more functional and better performance) way. 

> 
> r~
> 



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]