This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: ling dot ma dot program at gmail dot com
- Cc: libc-alpha at sourceware dot org, rth at twiddle dot net, aj at suse dot com, liubov dot dmitrieva at gmail dot com, hjl dot tools at gmail dot com, Ling Ma <ling dot ml at alibaba-inc dot com>
- Date: Thu, 15 May 2014 22:14:58 +0200
- Subject: Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
- Authentication-results: sourceware.org; auth=none
- References: <1396850238-29041-1-git-send-email-ling dot ma at alipay dot com> <20140513173616 dot GC5047 at domone dot podge>
Correction, in for following
On Tue, May 13, 2014 at 07:36:16PM +0200, OndÅej BÃlka wrote:
> > + ALIGN(4)
> > +L(gobble_data):
> > +#ifdef SHARED_CACHE_SIZE_HALF
> > + mov $SHARED_CACHE_SIZE_HALF, %r9
> > +#else
> > + mov __x86_shared_cache_size_half(%rip), %r9
> > +#endif
> > + shl $4, %r9
> > + cmp %r9, %rdx
> > + ja L(gobble_big_data)
> > + mov %rax, %r9
> > + mov %esi, %eax
> > + mov %rdx, %rcx
> > + rep stosb
> > + mov %r9, %rax
> > + vzeroupper
> > + ret
> > +
> > + ALIGN(4)
> > +L(gobble_big_data):
> > + sub $0x80, %rdx
> > +L(gobble_big_data_loop):
> > + vmovntdq %ymm0, (%rdi)
> > + vmovntdq %ymm0, 0x20(%rdi)
> > + vmovntdq %ymm0, 0x40(%rdi)
> > + vmovntdq %ymm0, 0x60(%rdi)
> > + lea 0x80(%rdi), %rdi
> > + sub $0x80, %rdx
> > + jae L(gobble_big_data_loop)
> > + vmovups %ymm0, -0x80(%r8)
> > + vmovups %ymm0, -0x60(%r8)
> > + vmovups %ymm0, -0x40(%r8)
> > + vmovups %ymm0, -0x20(%r8)
> > + vzeroupper
> > + sfence
> > + ret
>
> That loop does seem to help on haswell at all, It is indistingushible from
> rep stosb loop above. I used following benchmark to check that with
> different sizes but performance stayed same.
>
> #include <stdlib.h>
> #include <string.h>
> int main(){
> int i;
> char *x=malloc(100000000);
> for (i=0;i<100;i++)
> MEMSET(x,0,100000000);
>
> }
>
>
> for I in `seq 1 10`; do
> echo avx
> gcc -L. -DMEMSET=__memset_avx2 -lc_profile big.c
> time LD_LIBRARY_PATH=. ./a.out
> echo rep
> gcc -L. -DMEMSET=__memset_rep -lc_profile big.c
> time LD_LIBRARY_PATH=. ./a.out
> done
Sorry I forgotten that __memset_rep also has branch for large inputs so
what I wrote was wrong.
I retested it with fixed rep stosq and your loop is around 20% slower on
similar test so its better to remove that loop.
$ gcc big.c -o big
$ time LD_PRELOAD=./memset-avx2.so ./big
real 0m0.076s
user 0m0.066s
sys 0m0.010s
$ time LD_PRELOAD=./memset_rep.so ./big
real 0m0.063s
user 0m0.042s
sys 0m0.021s
I use a different benchmark to be sure, it could be download here and
run it commands above in that directory.
http://kam.mff.cuni.cz/~ondra/memset_consistency_benchmark.tar.bz2
For different implementation you need to create .so with function
memset, there is script compile that compiles all .s files provided that
first line is of shape
# arch_requirement function_name color