This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Faster memset on x64.

From: Andreas Jaeger <aj at suse dot com>
To: libc-alpha at sourceware dot org
Date: Thu, 09 May 2013 15:58:44 +0200
Subject: Re: [PATCH] Faster memset on x64.
References: <20130429181611 dot GA28442 at domone dot kolej dot mff dot cuni dot cz> <20130430154132 dot GA6521 at domone dot kolej dot mff dot cuni dot cz>

Intel, AMD developer, do you have any feedback on the performance ofthis patch? Please provide it until Monday, 13th - otherwise we'vewaited long enough on this one and I think this can go in with minor nits.

Ondrej, consider this approved (after including my comments below) andcommit on the 14th unless somebody vetoes.


On 04/30/2013 05:41 PM, OndÅej BÃlka wrote:

Last post contained older version. Here is up-to-date version.

On Mon, Apr 29, 2013 at 08:16:11PM +0200, OndÅej BÃlka wrote:

Hi,
this is second part of what I worked, memset.
It is basicaly memcpy from previous patch with loads replaced by
constant. However control flow is bit different as memset receives
bigger inputs than memcpy (generator attached.).

Performance characteristics are here:

http://kam.mff.cuni.cz/~ondra/memset_profile.html

When hooking gcc I got about 10% improvement for most architectures.
As this implementation has simpler control flow without computed jumps
it is faster by about 50 cycles on most architectures.
Exception are old core2 and athlon. There I need assumption that memset
receives 16-byte aligned inputs which is according to my profile in 99%
of cases true.


I added optimized __bzero as in my previous patch.

As I asked at
http://www.sourceware.org/ml/libc-alpha/2013-02/msg00213.html
I added candidate __memset_tail function which I will use in strncpy.
A code is equivalent to:
char* __memset_tail (char *x, int c, size_t n,char *ret)
   {
     memset(x,c,n);
     return ret;
   }
But can be used to save call overhead by making this tail call.


What could be done are prefetching and nontemporal loads.
I do not know how to make prefetching that pays itself on big inputs
more than what it loses on small inputs. I currently do not use
nontemporal loads.

I become more and more convinced that prefetching could be better
handled at cpu level. Idealy I want these in lower levels of cache (say
first 10% of L1 cache), rest in L2 cache (and so on when it does not
fit).

Possible implementation is following: Give i-th cache line in
stream weigth 1/i. We add this line to cache only if replacement has
frequency lower than weigth. If we do not add it to cache we decrease
replacement frequency.

I could implement wmemset as in commented code below. Is this wise to
keep it there should I move that separately?

OK for 2.18?

Ondra

	* sysdeps/x86_64/memset.S (memset): New implementation.
	(__bzero): Likewise.
	(__memset_tail): New function.


---
  sysdeps/x86_64/memset.S | 1414 ++++-------------------------------------------
  1 files changed, 99 insertions(+), 1315 deletions(-)

diff --git a/sysdeps/x86_64/memset.S b/sysdeps/x86_64/memset.S
index b393efe..d7fc5fe 100644
--- a/sysdeps/x86_64/memset.S
+++ b/sysdeps/x86_64/memset.S
@@ -19,17 +19,41 @@

  #include <sysdep.h>

-#define __STOS_LOWER_BOUNDARY	$8192
-#define __STOS_UPPER_BOUNDARY	$65536
+#ifndef ALIGN
+# define ALIGN(n) .p2align n
+#endif

  	.text
  #if !defined NOT_IN_libc
  ENTRY(__bzero)
-	mov	%rsi,%rdx	/* Adjust parameter.  */
-	xorl	%esi,%esi	/* Fill with 0s.  */
-	jmp	L(memset_entry)
+	movq	%rdi, %rax # Set return value.
+	movq	%rsi, %rdx # Set n.


Please use C style comments everywhere - even in assembler.

+	pxor	%xmm8, %xmm8
+	jmp	L(entry_from_bzero)
  END(__bzero)
  weak_alias (__bzero, bzero)
+
+/* Like memset but takes additional parameter with return value.  */
+ENTRY(__memset_tail)
+	movq	%rcx, %rax # Set return value.
+
+	movd	%esi, %xmm8
+	punpcklbw	%xmm8, %xmm8
+	punpcklwd	%xmm8, %xmm8
+	pshufd	$0, %xmm8, %xmm8
+
+	jmp	L(entry_from_bzero)
+END(__memset_tail)
+
+/*


Please remove this commented out code.

Andreas
--
 Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
  SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 NÃrnberg, Germany
   GF: Jeff Hawn,Jennifer Guild,Felix ImendÃrffer,HRB16746 (AG NÃrnberg)
    GPG fingerprint = 93A3 365E CE47 B889 DF7F  FED1 389A 563C C272 A126

Follow-Ups:
- [COMMITED] Faster memset on x64.
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]