This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[RFC] pthread_once: Use unified variant instead of custom x86_64/i386
- From: Torvald Riegel <triegel at redhat dot com>
- To: GLIBC Devel <libc-alpha at sourceware dot org>
- Cc: andi <andi at firstfloor dot org>
- Date: Fri, 11 Oct 2013 23:28:48 +0300
- Subject: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386
- Authentication-results: sourceware.org; auth=none
Assuming the pthread_once unification I sent recently is applied, we
still have custom x86_64 and i386 variants of pthread_once. The
algorithm they use is the same as the unified variant, so we would be
able to remove the custom variants if this doesn't affect performance.
The common case when pthread_once is executed is that the initialization
has already been performed; thus, this is the fast path that we can
focus on. (I haven't looked specifically at the generated code for the
slow path, but the algorithm is the same and I assume that the overhead
of the synchronizing instructions and futex syscalls determines the
performance of it, not any differences between compiler-generated code
and the custom code.)
The fast path of the custom assembler version:
testl $2, (%rdi)
jz 1f
xorl %eax, %eax
retq
The fast path of the generic pthread_once C code, as it is after the
pthread_once unification patch:
20: 48 89 5c 24 e8 mov %rbx,-0x18(%rsp)
25: 48 89 6c 24 f0 mov %rbp,-0x10(%rsp)
2a: 48 89 fb mov %rdi,%rbx
2d: 4c 89 64 24 f8 mov %r12,-0x8(%rsp)
32: 48 89 f5 mov %rsi,%rbp
35: 48 83 ec 38 sub $0x38,%rsp
39: 41 b8 ca 00 00 00 mov $0xca,%r8d
3f: 8b 13 mov (%rbx),%edx
41: f6 c2 02 test $0x2,%dl
44: 74 16 je 5c <__pthread_once+0x3c>
46: 31 c0 xor %eax,%eax
48: 48 8b 5c 24 20 mov 0x20(%rsp),%rbx
4d: 48 8b 6c 24 28 mov 0x28(%rsp),%rbp
52: 4c 8b 64 24 30 mov 0x30(%rsp),%r12
57: 48 83 c4 38 add $0x38,%rsp
5b: c3 retq
The only difference is more stack save/restore. However, a quick run of
benchtests/pthread_once (see the patch I sent for review) on my laptop
doesn't show any noticeable differences between both (averages of 8 runs
of the microbenchmark differ by 0.2%).
When splitting out the slow path like this:
static int
__attribute__((noinline))
__pthread_once_slow (once_control, init_routine)
/* ... */
int
__pthread_once (once_control, init_routine)
pthread_once_t *once_control;
void (*init_routine) (void);
{
int val;
val = *once_control;
atomic_read_barrier();
if (__builtin_expect ((val & __PTHREAD_ONCE_DONE) != 0, 1))
return 0;
else
return __pthread_once_slow(once_control, init_routine);
}
we get this for the C variants fast path:
00000000000000e0 <__pthread_once>:
e0: 8b 07 mov (%rdi),%eax
e2: a8 02 test $0x2,%al
e4: 74 03 je e9 <__pthread_once+0x9>
e6: 31 c0 xor %eax,%eax
e8: c3 retq
e9: 31 c0 xor %eax,%eax
eb: e9 30 ff ff ff jmpq 20 <__pthread_once_slow>
This is very close to the fast path of the custom assembler code.
I haven't looked further at i386, but the custom code is pretty similar
to the x86_64 variant.
What do you all prefer?:
1) Keep the x86-specific assembler versions?
2) Remove the x86-specific assembler versions and split out the slow
path?
2) Just remove the x86-specific assembler versions?