This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[PATCH 3/2] Use strspn/strcspn/strpbrk ifunc in internal calls.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Carlos O'Donell <carlos at redhat dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 18 Mar 2014 11:01:38 +0100
- Subject: [PATCH 3/2] Use strspn/strcspn/strpbrk ifunc in internal calls.
- Authentication-results: sourceware.org; auth=none
- References: <20140227123238 dot GA26291 at domone dot podge> <20140227124206 dot GA26474 at domone dot podge> <5318A03D dot 3000705 at redhat dot com> <20140306163241 dot GA11843 at domone dot podge> <5318B58B dot 5040704 at redhat dot com> <20140306205212 dot GB11843 at domone dot podge> <53192422 dot 2050101 at redhat dot com>
To make a strtok faster and improve performance in general we need to do one
additional change.
A comment:
/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
The speedup we get from using SSE4.2 instruction is likely eaten away
by the indirect call in the PLT. */
Does not make sense at all because nobody bothered to check it. Gap
between these implementations is quite big, when haystack is empty a
sse2 is around 40 cycles slower because it needs to populate a lookup
table and difference only increases with size. That is much bigger than
plt slowdown which is few cycles.
Even benchtest show a gap which also may be reverse by branch
misprediction but my internal benchmark shown.
simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2
Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719
Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625
This patch also handles strpbrk which is implemented by including a
x86_64/multiarch/strcspn.S file.
* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S
index 24f55e9..1b3e1aa 100644
--- a/sysdeps/x86_64/multiarch/strcspn.S
+++ b/sysdeps/x86_64/multiarch/strcspn.S
@@ -65,14 +65,7 @@ END(STRCSPN)
# undef END
# define END(name) \
cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2
-# undef libc_hidden_builtin_def
-/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
- The speedup we get from using SSE4.2 instruction is likely eaten away
- by the indirect call in the PLT. */
-# define libc_hidden_builtin_def(name) \
- .globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2
#endif
-
#endif /* HAVE_SSE4_SUPPORT */
#ifdef USE_AS_STRPBRK
diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S
index bf7308e..fde1e1e 100644
--- a/sysdeps/x86_64/multiarch/strspn.S
+++ b/sysdeps/x86_64/multiarch/strspn.S
@@ -50,12 +50,6 @@ END(strspn)
# undef END
# define END(name) \
cfi_endproc; .size __strspn_sse2, .-__strspn_sse2
-# undef libc_hidden_builtin_def
-/* It doesn't make sense to send libc-internal strspn calls through a PLT.
- The speedup we get from using SSE4.2 instruction is likely eaten away
- by the indirect call in the PLT. */
-# define libc_hidden_builtin_def(name) \
- .globl __GI_strspn; __GI_strspn = __strspn_sse2
#endif
#endif /* HAVE_SSE4_SUPPORT */