This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: HWCAP is method to determine cpu features, not selection mechanism.


On Fri, 2015-06-26 at 06:59 +0200, OndÅej BÃlka wrote:
> On Thu, Jun 25, 2015 at 10:58:46AM -0500, Steven Munroe wrote:
> > On Wed, 2015-06-10 at 14:50 +0200, OndÅej BÃlka wrote:
> > > On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote:
> > > > On Tue, 2015-06-09 at 17:42 +0200, OndÅej BÃlka wrote:
> > > > > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote:
> > > > > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote:
> > > > > > > 
> > > > > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote:
> > > > > > > > The proposed patch adds a new feature for powerpc. In order to get
> > > > > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB.
> > > > > > > > This enables users to write versioned code based on the HWCAP bits
> > > > > > > > without going through the overhead of reading them from the auxiliary
> > > > > > > > vector.
> > > > > > 
> > > > > > > i assume this is for multi-versioning.
> > > > > > 
> > > > > > The intent is for the compiler to implement the equivalent of
> > > > > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER
> > > > > > is RISC so we use the HWCAP. The trick to access the HWCAP[2]
> > > > > > efficiently as getauxv and scanning the auxv is too slow for inline
> > > > > > optimizations.
> > > > > > 
> > >Snip
> > 
> > After all was said and done, much more was said then done ....
> > 
> > Sorry I have been on vacation and them catching up on day job from being
> > on vacation. 
> > 
> > But i think we need to reset the discussion and hopefully eliminate some
> > misconceptions:
> > 
> > 1) This is not about the clever things what this clever things that this
> > community knows how to do, it is what the average Linux application
> > developer is willing to learn and use. 
> > 
> No, discussion is about what will lead to biggest overall performance
> gain. Clearly a best solution would be have compiler that automatically
> produces best code for each cpu, average application developer doesn't
> have to learn anything.
> 
Unfortunately this is not a realistic expectation in the real world.
Nothing is ever as simple are you would dlike.


> > I have tried to get them to use; CPU Platform libraries (library search
> > based on AT_PLATFORM). the AuxV and HWCAP directly, and use IFUNC. They
> > will not do this. 
> > 
> > They think this is all silly and too complicated. But we still want them
> > to exploit features of the latest processor while continuing to run on
> > existing processors in the field. Processor architectures evolve and we
> > have to give them a simple mechanism that they will actually use, to
> > handle this.  __builtin_cpu_supports() seems to be something they will
> > use.
> >
> There is error in reasoning: Something needs to be done. X is something. So
> X needs to be done.
> 
> They are wrong that ifunc, AT_PLATFORM are silly but correct that its
> complicated because problem is complicated.
> 
> As I said before it could be more harm than good. One example app
> programmer uses __builtin_cpu_supports but compiles file with
> -mcpu=power8 to get features he want. Then after upgrading gcc
> application breaks as gcc inserted unsupported instruction into
> nonpower8 branch.
> 
> Also its dubious that average programmer could do better than gcc with
> correct -mcpu flag. I asked before if you could measure impact of
> compiling applications with correct -mcpu and if hwcap could beat it.
> 
> For these you need distro maintainers setup compiling with
> AT_PLATFORM... and that will also cover libraries where developers don't
> care about powerpc niche platform.
> 
> If programmers don't use something it means that interface is bad and
> you should come with better interface.
> 
> A best interface would be tell them to use flags -O3 -mmulticpu 
> where -mmulticpu would take care of details by using
> AT_PLATFORM/ifuncs...
> 
> Or you could tell them to use __attribute__((multicpu)) for hot
> functions, below is how to implement that with macro that wraps ifunc,
> would they do better than just adding this to each function that shows
> more than 1% of total time in profile?
> 
> int foo (double x, double y) __attribute__((multicpu))
> {
>   return x * y;
> }
> 
> or
> 
> multicpu (int, foo, (x, y) (double x, double y))
> {
>   return x * y;
> }
> 
> with
> 
> #define multicpu(tp, name, arg, tparg) \
> tp __##name tparg; \
> tp __##name##_power5 tparg __attribute__((__target__("cpu=power5")))\
> { \
>   return (tp) __##name arg; \
> } \
> tp __##name##_power6 tparg __attribute__((__target__("cpu=power6")))\
> { \
>   return (tp) __##name arg; \
> } \
> tp name tparg \
> { \
>  /* select ifunc */ \
> } \
> tp __##name tparg 
> 
> 
> Also did you tried to ask application programmers after they used
> __builtin_cpu_supports if they tested it on both machines?
> 
> Thas pretty basic and it wouldn't be surprise that it would regulary
> introduce regressions as feature needs to be used in certain way.
> 
> I recalled new pitfall that user needs to ensure gains are more than
> savings. How big is typically powerpc branch cache? If user adds
> __builtin_cpu_supports checks to less frequent functions it may be
> always mispredicted as it isn't in cache and you pay for increased code
> size.
> 
You assume a lot.

You assume my team ans I do not know these techniques. We do.

You assume my team and I do not practice these technique in our own
code. We do.

You assume we do not advise our customers to use these techniques and
provide documentations on this topics. We do:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550


> 
> If situation is same as at x64 then if cpu supports foo means nothing.
> You need to be quite careful how do you use feature to get improvement.
> 
> For example take optimizing loop with avx/avx2. You have three choices
> 1. use 256 bit loads/stores and loop operation
> 2. use 128 bit loads/stores and merge/split them for loop operation
> 3. use 128 bit loads/stores and 128bit loop operation.
> 
> What you choose depends on if you do unaligned loads/stores or not. As
> these are quite expensive on fx10 you need to special case it even that
> it supports avx. On ivy bridge splitting/merging gives performance
> improvement but penalty is smaller. On haswell a 256bit loads/stores are
> faster that splitting/merging.
> 
> That was quite simple example. To complicate matters more even with
> haswell 256 bit loads/stores have big latency so you need to use them
> only in loops.
> 
You assume that my team and I do not about loop unrolling. We do.

You assume that we do not tell our customers this. We do.

However in this discussion, performance characteristics for Intel
processors are irrelevant.

> 
> > 2) This is not about exposing a private GLIBC resource (TCB) to the the
> > compiler. The TCB and TLS is part of the Platform ABI and must be known,
> > used, and understood by the compiler (GCC, LLVM, ...) binutils,
> > debuggers, etc in addition to GLIBC:
> > 
> > Power Architecture 64-Bit ELF V2 ABI Specification, OpenPOWER ABI for
> > Linux Supplement: Section 3.7.2 TLS Runtime Handling
> > 
> > This and other useful documents are available from the OpenPOWER
> > Foundation: http://openpowerfoundation.org/
> > 
> > If you look, you will see that TCB slots have already been allocated to
> > support other PowerISA specific features like; Event Based Branching,
> > Dynamic System Optimization, and Target Address Save. Recently we added
> > split-stack support for the GO language that required a TCB slot. So
> > adding a double word slot to cache AT_HWCAP and AT_HWCAP2 is no big
> > deal.
> > 
> > So far this all fits nicely in a single 128 byte cache-line. The TLS ABI
> > (which I defined back in back in 2004) reserved a full 4KB for the TCB
> > and extensions.
> > 
> > This all was not done lightly and was discussed extensively with the
> > appropriate developers in the corresponding projects. You all may not
> > have seen this because GLIBC not directly involved except as the owner
> > of ./sysdeps/powerpc/nptl/tls.h
> > 
> You should say first that it uses reserved memory. 
> 
> So it isn't issue now. But if plt is as expensive as you say it will
> quickly fill up. Save strcmp address in tcb to improve performance as
> strcmp is most called function in libc and you would save several
> magnitudes more on plt indirections than rarer hwcap. Then continue with
> less called functions until that makes sense.
> 
You assume my team and I do not know the performance characteristics of
our own platform. We do. 

You too could learn more by reading the 'POWER8 Processor Userâs Manual
for the Single-Chip Module' Available on OpenPOWER.org


> 
> > The only reason we raised this discussion here because we wanted to
> > publish a platform specific API
> > in ./sysdeps/unix/sysv/linux/powerpc/bits/ppc.h to make is easier for
> > the compilers to access it. And we felt it would be rude not discuss
> > this with the community.
> > 
> > 3) I would think that the platform maintainers would have the standing
> > to implement their own platform ABI? Perhaps the project maintainers
> > would like to weigh in on this?
> > 
> > 4) I have ask Carlos Seo to develop some micro benchmarks to illuminate
> > the performance implications of the various alternatives to the direct
> > TCB access proposal. If necessarily we can provide detail cycle accurate
> > instruction pipeline timings. 
> >
> Please benchmarks, microbenchmarks are not very useful, they measure
> small constant c in expression c*x - y where positive is improvement. If x is hundred times y then exact value of c doesn't matter.
> 
You assume that I do not know how to development a benchmarks that are
repeatable and meaningful. I do. How many books have you published on
that topic?

You don't know my platform.

You don't know my customers.

You don't know my team.

You don't know me.

But you assume a lot that is just irrelevant and or not factually true.

At this point you are acting like a troll that just disagrees with
everything said.

> There is still unknown basic use cases, it doesn't make sense do
> detailed measurement only to find that it on average saves hundred
> cycles per app but its used by one app in thousand and it costs each which
> doesn't use it a cycle. Thats net loss. Also performance will wary
> depending how frequent is usage, when its mostly on cold code then you
> have problems that hwcap branch is always mispredicted and increased
> instruction cache usage so not using hwcap could be better if you do
> only small saving.
> 
Again I have to live in the real world and deal with real customers who
are not too interested in my platform problems. They just want a
simple/quick solution that is easy for them to understand.

I am just trying to provide an option for them to use.

> So get some of these average programers, let them optimize some app with
> hwcap and then check result. 
> 

We are done with this discussion. 



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]