This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction

From: OndÅej BÃlka <neleai at seznam dot cz>
To: ling dot ma dot program at gmail dot com
Cc: libc-alpha at sourceware dot org, rth at twiddle dot net, aj at suse dot com, liubov dot dmitrieva at gmail dot com, hjl dot tools at gmail dot com, Ling Ma <ling dot ml at alibaba-inc dot com>
Date: Thu, 15 May 2014 22:14:58 +0200
Subject: Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
Authentication-results: sourceware.org; auth=none
References: <1396850238-29041-1-git-send-email-ling dot ma at alipay dot com> <20140513173616 dot GC5047 at domone dot podge>

Correction, in for following

On Tue, May 13, 2014 at 07:36:16PM +0200, OndÅej BÃlka wrote:
> > +	ALIGN(4)
> > +L(gobble_data):
> > +#ifdef SHARED_CACHE_SIZE_HALF
> > +	mov	$SHARED_CACHE_SIZE_HALF, %r9
> > +#else
> > +	mov	__x86_shared_cache_size_half(%rip), %r9
> > +#endif
> > +	shl	$4, %r9
> > +	cmp	%r9, %rdx
> > +	ja	L(gobble_big_data)
> > +	mov	%rax, %r9
> > +	mov	%esi, %eax
> > +	mov	%rdx, %rcx
> > +	rep	stosb
> > +	mov	%r9, %rax
> > +	vzeroupper
> > +	ret
> > +
> > +	ALIGN(4)
> > +L(gobble_big_data):
> > +	sub	$0x80, %rdx
> > +L(gobble_big_data_loop):
> > +	vmovntdq	%ymm0, (%rdi)
> > +	vmovntdq	%ymm0, 0x20(%rdi)
> > +	vmovntdq	%ymm0, 0x40(%rdi)
> > +	vmovntdq	%ymm0, 0x60(%rdi)
> > +	lea	0x80(%rdi), %rdi
> > +	sub	$0x80, %rdx
> > +	jae	L(gobble_big_data_loop)
> > +	vmovups	%ymm0, -0x80(%r8)
> > +	vmovups	%ymm0, -0x60(%r8)
> > +	vmovups	%ymm0, -0x40(%r8)
> > +	vmovups	%ymm0, -0x20(%r8)
> > +	vzeroupper
> > +	sfence
> > +	ret
> 
> That loop does seem to help on haswell at all, It is indistingushible from
> rep stosb loop above. I used following benchmark to check that with
> different sizes but performance stayed same.
> 
> #include <stdlib.h>
> #include <string.h>
> int main(){
>  int i;
>  char *x=malloc(100000000);
>   for (i=0;i<100;i++)
>    MEMSET(x,0,100000000);
> 
> }
> 
> 
> for I in `seq 1 10`; do
> echo avx
> gcc -L. -DMEMSET=__memset_avx2 -lc_profile big.c
> time LD_LIBRARY_PATH=. ./a.out
> echo rep
> gcc -L. -DMEMSET=__memset_rep -lc_profile big.c
> time LD_LIBRARY_PATH=. ./a.out
> done

Sorry I forgotten that __memset_rep also has branch for large inputs so
what I wrote was wrong.

I retested it with fixed rep stosq and your loop is around 20% slower on
similar test so its better to remove that loop.

$ gcc big.c -o big
$ time LD_PRELOAD=./memset-avx2.so ./big

real    0m0.076s
user    0m0.066s
sys     0m0.010s

$ time LD_PRELOAD=./memset_rep.so ./big

real    0m0.063s
user    0m0.042s
sys     0m0.021s

I use a different benchmark to be sure, it could be download here and
run it commands above in that directory.

http://kam.mff.cuni.cz/~ondra/memset_consistency_benchmark.tar.bz2

For different implementation you need to create .so with function
memset, there is script compile that compiles all .s files provided that
first line is of shape

# arch_requirement function_name color

Follow-Ups:
- Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
  - From: Ling Ma

References:
- Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]