This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[RFC] pthread_once: Use unified variant instead of custom x86_64/i386


Assuming the pthread_once unification I sent recently is applied, we
still have custom x86_64 and i386 variants of pthread_once.  The
algorithm they use is the same as the unified variant, so we would be
able to remove the custom variants if this doesn't affect performance.

The common case when pthread_once is executed is that the initialization
has already been performed; thus, this is the fast path that we can
focus on.  (I haven't looked specifically at the generated code for the
slow path, but the algorithm is the same and I assume that the overhead
of the synchronizing instructions and futex syscalls determines the
performance of it, not any differences between compiler-generated code
and the custom code.)

The fast path of the custom assembler version:
	testl	$2, (%rdi)
	jz	1f
	xorl	%eax, %eax
	retq

The fast path of the generic pthread_once C code, as it is after the
pthread_once unification patch:
  20:   48 89 5c 24 e8          mov    %rbx,-0x18(%rsp)
  25:   48 89 6c 24 f0          mov    %rbp,-0x10(%rsp)
  2a:   48 89 fb                mov    %rdi,%rbx
  2d:   4c 89 64 24 f8          mov    %r12,-0x8(%rsp)
  32:   48 89 f5                mov    %rsi,%rbp
  35:   48 83 ec 38             sub    $0x38,%rsp
  39:   41 b8 ca 00 00 00       mov    $0xca,%r8d
  3f:   8b 13                   mov    (%rbx),%edx
  41:   f6 c2 02                test   $0x2,%dl
  44:   74 16                   je     5c <__pthread_once+0x3c>
  46:   31 c0                   xor    %eax,%eax
  48:   48 8b 5c 24 20          mov    0x20(%rsp),%rbx
  4d:   48 8b 6c 24 28          mov    0x28(%rsp),%rbp
  52:   4c 8b 64 24 30          mov    0x30(%rsp),%r12
  57:   48 83 c4 38             add    $0x38,%rsp
  5b:   c3                      retq   

The only difference is more stack save/restore.  However, a quick run of
benchtests/pthread_once (see the patch I sent for review) on my laptop
doesn't show any noticeable differences between both (averages of 8 runs
of the microbenchmark differ by 0.2%).

When splitting out the slow path like this:

static int
__attribute__((noinline))
__pthread_once_slow (once_control, init_routine)
/* ... */

int
__pthread_once (once_control, init_routine)
     pthread_once_t *once_control;
     void (*init_routine) (void);
{
  int val;
  val = *once_control;
  atomic_read_barrier();
  if (__builtin_expect ((val & __PTHREAD_ONCE_DONE) != 0, 1))
    return 0;
  else
    return __pthread_once_slow(once_control, init_routine);
}

we get this for the C variants fast path:

00000000000000e0 <__pthread_once>:
  e0:   8b 07                   mov    (%rdi),%eax
  e2:   a8 02                   test   $0x2,%al
  e4:   74 03                   je     e9 <__pthread_once+0x9>
  e6:   31 c0                   xor    %eax,%eax
  e8:   c3                      retq   
  e9:   31 c0                   xor    %eax,%eax
  eb:   e9 30 ff ff ff          jmpq   20 <__pthread_once_slow>

This is very close to the fast path of the custom assembler code.

I haven't looked further at i386, but the custom code is pretty similar
to the x86_64 variant.


What do you all prefer?:
1) Keep the x86-specific assembler versions?
2) Remove the x86-specific assembler versions and split out the slow
path?
2) Just remove the x86-specific assembler versions?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]