This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 3/3] Add i386 memset and memcpy assembly functions
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: "H.J. Lu" <hjl dot tools at gmail dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 26 Aug 2015 16:29:24 +0200
- Subject: Re: [PATCH 3/3] Add i386 memset and memcpy assembly functions
- Authentication-results: sourceware.org; auth=none
- References: <20150826134631 dot GC19484 at gmail dot com>
On Wed, Aug 26, 2015 at 06:46:31AM -0700, H.J. Lu wrote:
> Add i386 memset and memcpy assembly functions with REP MOVSB/STOSB
> instructions. They will be used to implement i386 multi-arch memcpy.
>
> OK for master?
>
No, as rep stosb has terrible performance on most of machines, on ivy
bridge its around six times slower than rep stosq. I wouldn't be
surprised when you test it for affected machines it would be at least three times
slower than rep stosl on affected machines.
Only exception where you should use rep stosb that I know is haswell.
Perhaps you could adapt this implementation that I used for rep stosq
and change to rep stosl?
.text ;.globl memset_rep8; .type memset_rep8, @function;memset_rep8:; .cfi_startproc
movzbl %sil, %eax
lea (%rdi, %rdx), %rcx
movabsq $72340172838076673, %rsi
imulq %rsi, %rax
cmp $7, %rdx
jbe .Lless_16_bytes
movq %rax, (%rdi)
movq %rdi, %rsi
leaq 8(%rdi), %rdi
movq %rax, -8(%rcx)
andq $-8, %rdi
subq %rdi, %rcx
shrq $3, %rcx
rep stosq
movq %rsi, %rax
ret
.p2align 4
.Lless_16_bytes:
movq %rax, %rsi
movq %rdi, %rax
testb $4, %dl; jne .Lbetween_4_7_bytes
cmp $1, %dl; jbe .Lbetween_0_1_byte
movw %si, -2(%rcx)
movb %sil, (%rdi)
ret
.p2align 3
.Lbetween_4_7_bytes:
movl %esi, (%rdi)
movl %esi, -4(%rcx)
ret
.Lbetween_0_1_byte:
jb .Lzero_byte
movb %sil, (%rdi)
.Lzero_byte:
ret
.cfi_endproc ; .size memset_rep8, .-memset_rep8