This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Florian Weimer <fweimer at redhat dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Tue, 4 Feb 2014 15:12:52 +0100
- Subject: Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
- Authentication-results: sourceware.org; auth=none
- References: <52EBBCC2 dot 7090807 at redhat dot com> <20140131171911 dot GA25609 at domone dot podge> <52EF931A dot 3000508 at redhat dot com> <20140203144305 dot GA14697 at domone dot podge> <52EFC0CD dot 6030201 at redhat dot com>
On Mon, Feb 03, 2014 at 05:16:13PM +0100, Florian Weimer wrote:
> On 02/03/2014 03:43 PM, OndÅej BÃlka wrote:
>
> >And there is third factor that memcmp with small constant arguments
> >could be inlined. This is not case now but a patch would be welcome.
>
> Inlining memcmp in GCC has historically been a bad decision.
> Perhaps we could make an exception for memcmp calls with known
> alignment and really small sizes. In terms of GCC optimizations,
> dispatching to a few versions specialized for certain lengths, and a
> version that only delivers an unordered, boolean result promises
> significant wins as well.
>
That is problem in gcc that builtins are often badly optimized. Second
problem is that expansion needs to be small or you will lose when you
inline cold code.
Also making that a builtin adds unnecessary complexity, adding these
conditions to header is simpler.
In addition to constant sizes when you know that size is always larger
than 8 and mismatch is likely there then you could do use inlined version below.
There is no need for specialized unordered case when you do comparison,
gcc is smart enough to optimize these as well as memcmp(x,y,n) > 0 case. Following:
int foo (int x)
{
if (x>0) return 1;
if (x<0) return -1;
return 0;
}
int bar(int x){
if (foo(x))
return 4;
else
return 2;
}
gets optimized to
bar:
.LFB1:
.cfi_startproc
cmpl $1, %edi
sbbl %eax, %eax
andl $-2, %eax
addl $4, %eax
ret
And expansion that I talked about is here, I could make that cross
platform with check if unaligned loads are ok and bswap is reasonably
fast.
#include <stdint.h>
#include <string.h>
#undef memcmp
#define memcmp(x, y, n) \
({ \
void *__x = x, *__y = y; \
size_t __n = n; \
int __ret; \
if (__builtin_constant_p (__n >= 8)) \
{ \
uint64_t __a = __builtin_bswap64(*((uint64_t *) __x)); \
uint64_t __b = __builtin_bswap64(*((uint64_t *) __y)); \
if (__a > __b) \
__ret = 1; \
else if (__a < __b) \
__ret = -1; \
else \
__ret = __memcmp (__x + 8, __y + 8, __n - 8); \
} \
else \
__ret = __memcmp (__x, __y, __n); \
__ret;\
})
int foo(char *x, char *y){
if (memcmp(x,y,10) > 0)
return 15;
else
return 42;
}