This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction


Hi all,

Here is latest memset pach: http://www.yunos.org/tmp/memset-avx2.patch

When I send patch by git-send-email, libc-alpha@sourceware.org refuse
to show it,
Sorry for  Inconvenience to you

Thanks
Ling


2014-05-16 4:14 GMT+08:00, OndÅej BÃlka <neleai@seznam.cz>:
> Correction, in for following
>
> On Tue, May 13, 2014 at 07:36:16PM +0200, OndÅej BÃlka wrote:
>> > +	ALIGN(4)
>> > +L(gobble_data):
>> > +#ifdef SHARED_CACHE_SIZE_HALF
>> > +	mov	$SHARED_CACHE_SIZE_HALF, %r9
>> > +#else
>> > +	mov	__x86_shared_cache_size_half(%rip), %r9
>> > +#endif
>> > +	shl	$4, %r9
>> > +	cmp	%r9, %rdx
>> > +	ja	L(gobble_big_data)
>> > +	mov	%rax, %r9
>> > +	mov	%esi, %eax
>> > +	mov	%rdx, %rcx
>> > +	rep	stosb
>> > +	mov	%r9, %rax
>> > +	vzeroupper
>> > +	ret
>> > +
>> > +	ALIGN(4)
>> > +L(gobble_big_data):
>> > +	sub	$0x80, %rdx
>> > +L(gobble_big_data_loop):
>> > +	vmovntdq	%ymm0, (%rdi)
>> > +	vmovntdq	%ymm0, 0x20(%rdi)
>> > +	vmovntdq	%ymm0, 0x40(%rdi)
>> > +	vmovntdq	%ymm0, 0x60(%rdi)
>> > +	lea	0x80(%rdi), %rdi
>> > +	sub	$0x80, %rdx
>> > +	jae	L(gobble_big_data_loop)
>> > +	vmovups	%ymm0, -0x80(%r8)
>> > +	vmovups	%ymm0, -0x60(%r8)
>> > +	vmovups	%ymm0, -0x40(%r8)
>> > +	vmovups	%ymm0, -0x20(%r8)
>> > +	vzeroupper
>> > +	sfence
>> > +	ret
>>
>> That loop does seem to help on haswell at all, It is indistingushible
>> from
>> rep stosb loop above. I used following benchmark to check that with
>> different sizes but performance stayed same.
>>
>> #include <stdlib.h>
>> #include <string.h>
>> int main(){
>>  int i;
>>  char *x=malloc(100000000);
>>   for (i=0;i<100;i++)
>>    MEMSET(x,0,100000000);
>>
>> }
>>
>>
>> for I in `seq 1 10`; do
>> echo avx
>> gcc -L. -DMEMSET=__memset_avx2 -lc_profile big.c
>> time LD_LIBRARY_PATH=. ./a.out
>> echo rep
>> gcc -L. -DMEMSET=__memset_rep -lc_profile big.c
>> time LD_LIBRARY_PATH=. ./a.out
>> done
>
> Sorry I forgotten that __memset_rep also has branch for large inputs so
> what I wrote was wrong.
>
> I retested it with fixed rep stosq and your loop is around 20% slower on
> similar test so its better to remove that loop.
>
> $ gcc big.c -o big
> $ time LD_PRELOAD=./memset-avx2.so ./big
>
> real    0m0.076s
> user    0m0.066s
> sys     0m0.010s
>
> $ time LD_PRELOAD=./memset_rep.so ./big
>
> real    0m0.063s
> user    0m0.042s
> sys     0m0.021s
>
> I use a different benchmark to be sure, it could be download here and
> run it commands above in that directory.
>
> http://kam.mff.cuni.cz/~ondra/memset_consistency_benchmark.tar.bz2
>
> For different implementation you need to create .so with function
> memset, there is script compile that compiles all .s files provided that
> first line is of shape
>
> # arch_requirement function_name color
>
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]