This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
RE: bzero/bcopy/bcmp/mempcpy (was: Improve strncpy performance further)
- From: "Wilco Dijkstra" <wdijkstr at arm dot com>
- To: "'Roland McGrath'" <roland at hack dot frob dot com>
- Cc: <libc-alpha at sourceware dot org>
- Date: Thu, 15 Jan 2015 15:48:47 -0000
- Subject: RE: bzero/bcopy/bcmp/mempcpy (was: Improve strncpy performance further)
- Authentication-results: sourceware.org; auth=none
- References: <001801d02b72$6ce0c3c0$46a24b40$ at com> <20150108185812 dot 285782C3BF6 at topped-with-meat dot com> <001901d02c0d$43cf9920$cb6ecb60$ at com> <20150109191632 dot 694692C3C1F at topped-with-meat dot com> <001a01d02dc9$bd6f0370$384d0a50$ at com> <20150113191449 dot AD91B2C39DC at topped-with-meat dot com> <001e01d03003$f67b8670$e3729350$ at com> <20150114193244 dot 44C022C39DB at topped-with-meat dot com>
Roland McGrath wrote:
> Wilco Dijkstra wrote:
> > We need something like this in string.h so we always optimize all calls to
> > standard optimized functions, irrespectively of the compiler and options used:
>
> We would need that if we wanted to do that. But these entrypoints are all
> old and deprecated. They are only for the benefit of old code. Any code
> so old that it hasn't been touched since there were actually systems to
> build it on that don't have the C89 standard functions surely has worse
> performance issues than this. Making the deprecated functions optimal only
> encourages people to keep using them.
Agreed, however they appear to be used in a lot of code, including benchmarks.
For example a quick grep shows there are a large number of occurrences of
bzero and bcopy in SPEC2006.
> > Now the only remaining one to deal with is mempcpy - I'd like something like
> > this in string/strings2.h:
>
> Why? It's trivial enough for each memcpy implementation to implement
> mempcpy too, and for many implementations rolling it in might save an
> instruction or two over the generic addition. It doesn't seem worth
> the complexity to bother with anything in the header files.
OK, so the goal of many of the changes I've been making is as follows:
By default GLIBC should provide the most efficient generic implementations
so that a new target is not forced to write a large number of optimized
assembler functions in order to get reasonable performance. Additionally,
given that all targets add optimized versions of a few key functions
(such as memcpy, memset, strlen), use those whenever feasible rather than
less widely used variants.
Back to mempcpy, not only is inlining mempcpy simple and a good idea, it is
also the most efficient implementation. If you create a separate optimized
implementation of mempcpy, it requires 1-2 extra instructions and increases
pressure on caches and branch predictors. Another approach would be to set
the return value at the start of memcpy so that mempcpy can jump past it.
This means 1 extra instruction in every memcpy invocation plus an extra
branch for mempcpy. Neither option is clearly better than just inlining.
This ignores the additional effort to write/test mempcpy which could be
spent on more important things. It appears most targets have not bothered
with mempcpy as a result.
So to me adding the inline version is a no-brainer and should have been done
a long time ago.
Wilco