This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Andrew Senkevich <andrew dot n dot senkevich at gmail dot com>
Cc: GNU C Library <libc-alpha at sourceware dot org>
Date: Sun, 6 Jul 2014 15:42:46 +0200
Subject: Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
Authentication-results: sourceware.org; auth=none
References: <CAMXFM3t+TwhkeJbDXz0TSt-MZH3KOXTDWoT1nTiWjEyw1VSgcg at mail dot gmail dot com>

On Fri, Jul 04, 2014 at 09:11:22PM +0400, Andrew Senkevich wrote:
> Hi,
> 
> this new functions based on new memcpy which is 32bit analogue of
> x86_64 sse2 unaligned memcpy version.
> Benchmarked on Silvermont, Haswell, Ivy Bridge, Sandy Bridge and
> Westmere, performance results attached.
> Testsuite was run on x86_64 with no new regressions.
> 
> Change log:
> 
> 2014-07-04  Andrew Senkevich  <andrew.n.senkevich@gmail.com>
> 
>         * sysdeps/i386/i686/multiarch/memcpy-sse2-unaligned.S: New file,
>         contains implementation optimized with sse2 unaligned loads/stores.
>         * sysdeps/i386/i686/multiarch/memmove-sse2-unaligned.S: Likewise.
>         * sysdeps/i386/i686/multiarch/mempcpy-sse2-unaligned.S: Likewise.
>         * sysdeps/i386/i686/multiarch/memcpy.S: Selection of new function if
>         bit_Fast_Unaligned_Load is set.
>         * sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise.
>         * sysdeps/i386/i686/multiarch/memmove.S: Likewise.
>         * sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise.
>         * sysdeps/i386/i686/multiarch/mempcpy.S: Likewise.
>         * sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise.
>         * sysdeps/i386/i686/multiarch/Makefile: Added new files to build.
>         * sysdeps/i386/i686/multiarch/ifunc-impl-list.c
> (__libc_ifunc_impl_list):
>         Added testing of new functions.

+
+ENTRY (MEMCPY)
+	ENTRANCE
+	movl	LEN(%esp), %ecx
+	movl	SRC(%esp), %eax
+	movl	DEST(%esp), %edx
+
+	cmp	%edx, %eax
+	je	L(return)
+
As case src==dest is quite rare this would slow down implementation, drop that.

+# ifdef USE_AS_MEMMOVE
+	jg	L(check_forward)
+
+	add	%ecx, %eax
+	cmp	%edx, %eax
+	movl	SRC(%esp), %eax
+	jle	L(forward)
+
Also you do not need this check here, until 128 bytes we first read entire src and only after that do writes.

snip

Does that prefetch improve performance? On x64 it harmed performance and 128 bytes looks too small to matter.
+
+	prefetcht0 -128(%edi, %esi)
+
+	movdqu	-64(%edi, %esi), %xmm0
+	movdqu	-48(%edi, %esi), %xmm1
+	movdqu	-32(%edi, %esi), %xmm2
+	movdqu	-16(%edi, %esi), %xmm3
+	movdqa	%xmm0, -64(%edi)
+	movdqa	%xmm1, -48(%edi)
+	movdqa	%xmm2, -32(%edi)
+	movdqa	%xmm3, -16(%edi)
+	leal	-64(%edi), %edi
+	cmp	%edi, %ebx
+	jb	L(mm_main_loop_backward)
+L(mm_main_loop_backward_end):
+	POP (%edi)
+	POP (%esi)
+	jmp	L(mm_recalc_len)
+

+L(mm_recalc_len):
+/* Compute in %ecx how many bytes are left to copy after
+	the main loop stops.  */
+	movl	%ebx, %ecx
+	subl	%edx, %ecx
+	jmp	L(mm_len_0_or_more_backward)
+
That also looks slow as it adds unpredictable branch. On x64 we read start and end into registers before loop starts, and write these registers when it ends.
If you align to 16 bytes instead 64 you need only 4 registers to save end, 4 working and save 16 bytes at start into stack.


+	movdqu	%xmm0, (%edx)
+	movdqu	%xmm1, 16(%edx)
+	movdqu	%xmm2, 32(%edx)
+	movdqu	%xmm3, 48(%edx)
+	movdqa	%xmm4, (%edi)
+	movaps	%xmm5, 16(%edi)
+	movaps	%xmm6, 32(%edi)
+	movaps	%xmm7, 48(%edi)
Why did you add floating point moves here?


+
+/* We should stop two iterations before the termination
+	(in order not to misprefetch).  */
+	subl	$64, %ecx
+	cmpl	%ebx, %ecx
+	je	L(main_loop_just_one_iteration)
+
+	subl	$64, %ecx
+	cmpl	%ebx, %ecx
+	je	L(main_loop_last_two_iterations)
+
Same comment that prefetching will unlikely help, so you need show that it helps versus variant where you omit it.

+
+	.p2align 4
+L(main_loop_large_page):

However here prefetching should help as its sufficiently large, also loads could be nontemporal.

Follow-Ups:
- Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
  - From: Andrew Senkevich
- Re: [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
  - From: Andrew Senkevich

References:
- [PATCH] x86_32: memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk optimized with SSE2 unaligned loads/stores
  - From: Andrew Senkevich

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]