This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: HWCAP is method to determine cpu features, not selection mechanism.


On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 10-06-2015 12:09, OndÅej BÃlka wrote:
> > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 10-06-2015 11:16, Szabolcs Nagy wrote:
> >>> On 10/06/15 14:35, Adhemerval Zanella wrote:
> >>>> I agree that adding an API to modify the current hwcap is not a good
> >>>> approach. However the cost you are assuming here are *very* x86 biased,
> >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) 
> >>>> to load an external variable defined in a shared library, where for
> >>>> powerpc it is more costly:
> >>>
> >>> debian codesearch found 4 references to __builtin_cpu_supports
> >>> all seem to avoid using it repeatedly.
> >>>
> >>> multiversioning dispatch only happens at startup (for a small
> >>> number of functions according to existing practice).
> >>>
> >>> so why is hwcap expected to be used in hot loops?
> >>>
> >>
> >> Good question, I do not know and I believe Steve could answer this
> >> better than me.  I am only advocating here that assuming x86 costs
> >> for powerpc is not the way to evaluate this patch.
> > 
> > Sorry but your details don't matter when underlying idea is just bad.
> > Even if getting hwcap took 20 cycles otherwise it would still be bad
> > idea. As you need to use hwcap only once at initialization bringing cost
> > is completely irrelevant.
> > 
> > First as I explained major flaw of Steve approach how exactly do you
> > ensure that gcc won't insert newer instruction that would lead to crash
> > on older platform?
> > 
> > Second is that it makes no sense. If you are at situation where hwcap
> > access gets noticable on profile a checking is also noticable on
> > profile. So use ifunc which will save you that additional cycles on
> > checking hwcap bits.
> > 
> > A programmer that uses hwcap in hot loop is just incompetent. Its stays
> > constant on application. So he should make more copies of loop, each
> > with appropriate options.
> > 
> > Then even if compiler still handled these issues correctly you will
> > probaly lose more on missed compiler optimizations that your supposed
> > gain. Compiler can select suboptimal patch as he doesn't want to expand
> > function too much due size concerns.
> > 
> > That quite easy, for example in following would get magnitude slower
> > with hwcap than ifuncs. Reason is that even gcc-5.1 doesn't split it
> > into two branches each doing shift. Instead it emits div instruction
> > which takes forever.
> > 
> > int hwcap;
> > unsigned int foo(unsigned int i)
> > {
> >   int d = 8;
> >   if (hwcap & 42)
> >     d = 4;
> >   return i / d;
> > }
> > 
> 
> And you can use GCC extensions to generate architecture specific instructions
> based on architecture specific flags (check testsuite/gcc.target/powerpc/ppc-target-1.c).
> And these are architecture specific and just a subset of options are enabled.
> 
> And my understanding is to optimize hwcap access to provide a 'better' way
> to enable '__builtin_cpu_supports' for powerpc.  IFUNC is another way to provide
> function selection, but it does not exclude that accessing hwcap through
> TLS is *faster* than current options. It is up to developer to decide to use
> either IFUNC or __builtin_cpu_supports. If the developer will use it in
> hot loops or not, it is up to them to profile and use another way.
> 
> You can say the same about current x86 __builtin_cpu_supports support: you should
> not use in loops, you should use ifunc, whatever.

Sorry but no again. We are talking here on difference between variable
access and tcb access. You forgot to count total cost. That includes
initialization overhead per thread to initialize hwcap, increased
per-thread memory usage, maintainance burden and increased cache misses.
If you access hwcap only rarely as you should then per-thread copies
would introduce cache miss that is more costy than GOT overhead. In GOT
case it could be avoided as combined threads would access it more often.

So if your multithreaded application access hwcap maybe 10 times per run 
you would likely harm performance.

I could from my head tell ten functions that with tcb entry lead to much
bigger performance gains. So if this is applicable I will submit strspn
improvement that keeps 32 bitmask and checks if second argument didn't
changed. That would be better usage of tls than keeping hwcap data.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]