This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB


On Wed, Jul 01, 2015 at 02:12:20PM -0500, Steven Munroe wrote:
> On Tue, 2015-06-09 at 12:38 -0400, Rich Felker wrote:
> > On Mon, Jun 08, 2015 at 06:03:16PM -0300, Carlos Eduardo Seo wrote:
> > > 
> > > The proposed patch adds a new feature for powerpc. In order to get
> > > faster access to the HWCAP/HWCAP2 bits, we now store them in the
> > > TCB. This enables users to write versioned code based on the HWCAP
> > > bits without going through the overhead of reading them from the
> > > auxiliary vector.
> > > 
> > > A new API is published in ppc.h for get/set the bits in the
> > > aforementioned memory area (mainly for gcc to use to create
> > > builtins).
> > 
> > Do you have any justification (actual performance figures for a
> > real-world usage case) for adding ABI constraints like this? This is
> > not something that should be done lightly. My understanding is that
> > hwcap bits are normally used in initializing functions pointers (or
> > equivalent things like ifunc resolvers), not again and again at
> > runtime, so I'm having a hard time seeing how this could help even if
> > it does make the individual hwcap accesses measurably faster.
> > 
> > It would also be nice to see some justification for the magic number
> > offsets. Will they be stable under changes to the TCB structure or
> > will preserving them require tip-toeing around them?
> > 
> 
> This discussion has metastasizes into so many side discussions, meta
> discussion, personal opinions etc that i would like to start over at the
> point where we where still discussing how to implement something
> reasonable. 
> 
> First a level set on requirements and goals.
> 
> The intent is allow application developers to develop new application
> for Linux on Power and simply the porting of existing Linux applications
> to Power. And encourage then to apply the same level of platform
> optimization to Power as they do for other Linux platforms.
>
>From your proposal it didn't seem so. If this is the goal then you
should reach wider consensus to find cross-platform mechanism as
programmers would be discouraged by having to learn yet another custom
interface for powerpc.

 
> While there are a near infinity of options (some of which some members
> of this community think are stupid) and I have seen them all being used.
> As general rule I find it counter productive to call the customer (All
> Linux Application Developers are our customers) stupid their face, so I
> try to explain the options and encourage them to use many of the
> techniques that this community thinks are not stupid.
> 
That is good policy but that forces you to have higher standard and
strive to newer make a bad suggestion as customers could accept that
without question. So next time make clear that its customer wish and you
personally oppose that otherwise we would rigthfully tell you that you
shouldn't make that mistake.

> But as rule the application developer are busy, don't have much patience
> for nonsense like IFUNC and AT_PLATFORM library search strategies. They
> tend to use what they already know, apply minimal effort to solve the
> immediate problem, and move on!
> 
> One of the "things they already know" is the __built_cpu_is()
> __built_cpu_supports() GCC builtins for X86. To goal of this simple
> proposal is enable that for powerpc powerpc64 and powerpc64le, based on
> the existing AT_HWCAP/AT_HWCAP2 mechanisms.
>
While that is true you are too focused on __builtin_cpu_supports using
hwcap to see bigger picture. You need to distinguish between primitives
(hwcap, ifunc, AT_LIBRARY, fat libraries...) and interfaces. A
__builtin_cpu_supports is one interface and not neccesarry best one.

Finding a best interface is worthwide goal and I don't like to introduce 
worse interfaces as average programmers would use them and it would be
harder to change than if they did rigth one in first place.

How __builtin_cpu_supports is implemented is irrelevant. Gcc could
decide to make fat library that replaces every function by ifunc, use
ifunc for every function with __built_cpu_supports, we could add support
of fat libraries to linker... 

So why teach users ugly interface if they could use better and safer ones?
 
> Another observation is that many of these applications are deployed as
> shared object libraries and frequently are not linked directly to the
> main application but loaded via dl_open() are runtime. So clever
> solutions that are only simple and/or fast from a main programs but
> difficult and/or slow for dl_open() library are not an option.
> 
That removes performance argument why gcc shouldn't use ifunc. As these
use plt it wouldn't slow them down but checking hwcap bit inside
function would.

> They are very firm about a "single binary built" for all supported
> distros and all supported hardware generations.
> 
Needs of linux community are different that needs of your customers. You
have problem that platform-specific code increases size so for
distributions a best way would be split it into several files and they
could transfer package with binaries optimized only for current
cpu+generic ones. That requirement forces fat binaries and would
increase compile time a lot.

> And finally these applications tend to be massive C++ programs composed
> of smallish members functions and byzantine layers of templates. I have
> not observed wide use of private/hidden and so these libraries tend to
> expose every member function as a PLT entry, which resists most
> in-lining opportunities.
> 
And why customer couldn't use gcc -symbolic? This is strong argument
againist hwcap optimization. You could pair each hwcap use with plt
overhead of function and you will probably lose more cycles from
increased instruction cache usage than few cycles in member function
that does single operation.

And when you mentioned templates how often would you see uses like

template foo <bool could_do_x, bool could_do_y> ...

foo <__builtin_cpu_supports (x), __builtin_cpu_supports (y)> f;

> Net this is a harder problem then it looks.
> 
> So lets write down some requirements:
> 
> 0) Something the average application developer will understand and use.

Thats problem with __builtin_cpu_supports. Developers would use that but
not understand. Instead of being fixed on that a easier would be adding
a flag to gcc to handle that. Gcc could support builtins with multiple
implementation that when compiled would generate several variants
depending on cpu.

Or if __builtin_cpu_supports is used then gcc should treat it on higher
level and split that into two functions where one use feature but other
don't.


> 1) In any user library, in including ones loaded via LD_PRELOAD and
> dl_open().
> 2) Across multiple Distro versions and across Distros (using different
> GLIBC versions).
> 
> And goals for the Power implementation:
> 
> 1) As fast as possible accounting for the limits of the ABI, ISA and
> Micro-architecture.
> 1a) Minimal path length to obtain the hwcap bit vector for test
> 1b) Limited exposure to micro-architecture hazards including
> indirection.
> 2) Simple and reliable initialization of the cached values.
> 3) And without relying on .text relocation in libraries.
> 
> First lets dispose of the obvious. Extern static variables.
> 
> This is not horrible for larger grained examples but can be less than
> optimal for fine grained C++ examples. As stated above the hwcap will
> not be local to the user library. As PowerISA does not have PC-relative
> addressing our ABI requires that R2 (AKA the TOC pointer) is set to
> address the local (for this libraries) GOT/TOC/PLT before we access any
> static variable and extern require an indirect load of the extern hwcap
> address from the GOT/TOC.
> 
> In addition, since we are potentially changing R2 (AKA the TOC pointer)
> we are now obligated to save and restore the R2.
> 
> Now the design of POWER assumes that as RISC architecture with lots of
> registers and being designed for massive memory bandwidth and
> out-of-order execution, the processor core does no optimized for
> programs that store to then immediately reload from a memory location.
> In a machine with 16-pipelines per core and capable of dispatching up to
> 8 instructions per cycle, "immediate" has an amazingly broad definition
> (many 10s of instructions).
> 
> So the store and reload of the TOC pointer can hit the Load-hit-store
> hazard (essentially the load got issued (out-of-order) before the store
> it depended on was complete or at a stage where a bypass was available)
> even across the execution of the called function. While the core detects
> and corrects this state, it does so in a heavy handed way (instruction
> rejects (11 cycles each) or instruction fetch flush (worse)). Lets just
> say this is something to avoid if you can.
> 
> So introducing a static variable to C++ functions that would not
> normally access static should be avoided. Many C++ member functions are
> small enough execute completely within the available (volatile) register
> and don't even need a stack-frame. So a __builtin_cpu_supports() design
> based on none local extern static would be a unforced error in these
> cases.
> 
> Of course the TCB based proposal avoids all of this because the TCB
> pointer (R13) is constant across all functions in a thread (not
> save/restored in the user application).
> 

Which isn't obvious at all. Main mistake is assuming that variable needs
to be static. There is no reason why gcc shouldn't generate code
equivalent to including hwcap.h and adding equivalent of hwcap.c when linking:

hwcap.h:
int __hwcap __attribute__ ((visibility ("hidden"))) ;

hwcap.c:

#include <hwcap.h>

// gcc needs to make this first constructor.
extern int __global_hwcap;
void __attribute__ ((constructor)) 
set_hwcap () 
{
  __hwcap = __global_hwcap;
}

Also this is friendlier when we use optimization to use a byte to store
each hwcap bit.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]