This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386


On Fri, 2013-10-11 at 22:40 -0700, pinskia@gmail.com wrote:
> 
> > On Oct 11, 2013, at 1:28 PM, Torvald Riegel <triegel@redhat.com> wrote:
> > 
> > Assuming the pthread_once unification I sent recently is applied, we
> > still have custom x86_64 and i386 variants of pthread_once.  The
> > algorithm they use is the same as the unified variant, so we would be
> > able to remove the custom variants if this doesn't affect performance.
> > 
> > The common case when pthread_once is executed is that the initialization
> > has already been performed; thus, this is the fast path that we can
> > focus on.  (I haven't looked specifically at the generated code for the
> > slow path, but the algorithm is the same and I assume that the overhead
> > of the synchronizing instructions and futex syscalls determines the
> > performance of it, not any differences between compiler-generated code
> > and the custom code.)
> > 
> > The fast path of the custom assembler version:
> >    testl    $2, (%rdi)
> >    jz    1f
> >    xorl    %eax, %eax
> >    retq
> > 
> > The fast path of the generic pthread_once C code, as it is after the
> > pthread_once unification patch:
> >  20:   48 89 5c 24 e8          mov    %rbx,-0x18(%rsp)
> >  25:   48 89 6c 24 f0          mov    %rbp,-0x10(%rsp)
> >  2a:   48 89 fb                mov    %rdi,%rbx
> >  2d:   4c 89 64 24 f8          mov    %r12,-0x8(%rsp)
> >  32:   48 89 f5                mov    %rsi,%rbp
> >  35:   48 83 ec 38             sub    $0x38,%rsp
> >  39:   41 b8 ca 00 00 00       mov    $0xca,%r8d
> >  3f:   8b 13                   mov    (%rbx),%edx
> >  41:   f6 c2 02                test   $0x2,%dl
> >  44:   74 16                   je     5c <__pthread_once+0x3c>
> >  46:   31 c0                   xor    %eax,%eax
> >  48:   48 8b 5c 24 20          mov    0x20(%rsp),%rbx
> >  4d:   48 8b 6c 24 28          mov    0x28(%rsp),%rbp
> >  52:   4c 8b 64 24 30          mov    0x30(%rsp),%r12
> >  57:   48 83 c4 38             add    $0x38,%rsp
> >  5b:   c3                      retq   
> 
> Seems like this is a good case where shrink wrapping should have helped.  What version of GCC did you try this with and if it was 4.8 or latter, can you file a bug for this missed opt?

I used gcc 4.4.

If 4.8 generates a leaner fast path, would people think that this is
enough reason to not split out the fast path manually?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]