This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: faster memset


Tim Prince wrote:
Eric Blake wrote:
Aaron J. Grier <aaron <at> frye.com> writes:

On Thu, May 22, 2008 at 04:56:54PM +0000, Eric Blake wrote:
My patched assembly is no longer sensitive to alignment, and always
gets the speed of 8-byte alignment.  This clinches it - for memset,
x86 assembly is noticeably faster than C.
have you done comparisons with the builtin memset() in recent versions
of gcc?


I was testing with gcc 3.4.4, which does have __builtin_memset. But my understanding is that __builtin_memset defers to the library function on cases it cannot optimize at compile time? At any rate, my test app called the library function via a function pointer - does __builtin_memset even have an address to be used via a function pointer?


If I understand it correctly, __builtin_memset(ptr,0,8) is a good example of where the compiler optimization helps (it is faster to open-code two 32-bit writes than to call a function), in which case that is faster than anything I can code in assembly. But __builtin_memset(ptr,0,1000), even though 1000 is constant, starts to be such a large amount of open-coded assignments that the compiler probably falls back to the library routine anyway, probably trusting that the library knows more architecture tricks for efficiency than what you can represent generically in gcc's builtin definition table. Finally, __builtin_memset(ptr,0,len) cannot be optimized, since len is not known at compile time, so the compiler must fall back on the library.

In other words, by comparing against __builtin_memset, wouldn't I merely be comparing against my own implementation for most of the interesting cases?


gcc for i386 chooses the __builtin_memset where it recognizes possibilities to optimize code size. gcc x86_64 default configuration calls the library function, except for those few cases such as you mention where a small number of int operations is suitable. Only recently did glibc implement a memset() with good performance for long strings, agreed upon by developers for both AMD and Intel. So it would be interesting to compare with that implementation.
I was thinking more of memcpy() here, sorry. memset does have the quirk that it needs a strategy which switches to nontemporal store when the string length approaches some large fraction of cache size.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]