This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

çå: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction


Thanks for your correction!
I'm always using test-memcpy.c from glibc to check and compare performance before today,
based on it we find the best result and send out our patch,  currently we should discard it?
Soon I will test those functions with your profile and other release versions.
If I was wrong, please correct me.

Thanks
Ling


-----éäåä-----
åää: OndÅej BÃlka [mailto:neleai@seznam.cz] 
åéæé: 2013å6æ6æ 20:55
æää: Ling Ma
æé: libc-alpha@sourceware.org; Ling
äé: Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction

On Thu, Jun 06, 2013 at 06:07:51PM +0800, Ling Ma wrote:
> Hi Ondra
> I attached results as below:
> 1) gcc-test-memcpy-output: it compares memcpy files including your 
> memcpy_new, memcpy_sse2_unaligned, memcpy_ssse3_back, memcpy_ssse3, 
> memcpy_vzeroupper_avx2(I added vzeroupper instruction to memcpy_avx2), 
> and memcpy_avx2. the format is from gcc test-memcpy.c
> 
> 2) results-no-vzeroupper.tar.bz2,: it outputs comparison results 
> including memcpy_avx2 without vzeroupper as you suggested.
> 
> 3) results-vzeroupper.tar.bz2,: it outputs comparison results 
> including memcpy_vzeroupper_avx2 as you suggested.
> 
> Any questions please let me know.
> 
These results show that your patch is 35% slower for gcc see following line.

Time ratio to fastest:
memcpy_glibc: 134.517062% memcpy_new_small: 100.000000% memcpy_new: 101.120206% __memcpy_avx2: 136.926079%

It is about same as was glibc because header it has big overhead due of computed gotos there.

I forgot to mention that in result.html file you can switch between two
modes:
byte - show according to number of bytes block - show according to number of aligned 16 byte blocks that need be written.

In results_rand/result.html switched to block mode I see that your code starts being faster from 400 blocks(6400 bytes) onward. For sizes that large you soon hit L1 cache exhaustion and in results results_rand_L2/result.html are much closer.

It looks that best way is use my unaligned header and add avx2 loop there. 

I generated my file from variant/memcpy_new.c in benchmark by command
 gcc-4.7 -g -O3 -fPIC -Ivariant  variant/memcpy_new.c -S followed by some manual optimization.

You could change loop in c file and then retest.

Ondra




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]