This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction

From: Ling Ma <ling dot ma dot program at gmail dot com>
To: Ondřej Bílka <neleai at seznam dot cz>
Cc: libc-alpha at sourceware dot org, aj at suse dot com, liubov dot dmitrieva at gmail dot com, Ma Ling <ling dot ml at alibaba-inc dot com>
Date: Tue, 30 Jul 2013 13:35:49 +0800
Subject: Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
References: <1375090922-8418-1-git-send-email-ling dot ma dot program at gmail dot com> <20130729171927 dot GA12218 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dNY9KP_OdGNW79iLiCHu4L=8fCNFg=ZZpMiRFN0CHJZ1g at mail dot gmail dot com> <20130730044925 dot GA6890 at domone dot kolej dot mff dot cuni dot cz>

>> >> +L(less_128bytes):
>> >> +	xor	%esi, %esi
>> >> +	mov	%ecx, %esi
>> > And this? A C equivalent of this is
>> > x = 0;
>> > x = y;
>> Ling: we used mov %sil, %cl in above code, now %esi become  as
>> destination register(mov %ecx, %esi),  there is one false dependence
>> hazard, we use xor r1, r1 to ask decode stage to break the dependence,
>> and insight pipeline xor r1, r1  will be removed  before entering into
>> execution stage.
>>
> That is pointless as mov breaks false dependencies.
>
> Anyway a code you use is redudnand. You already have that computed so
> simple mov %xmm0, %rcx will do a job.

Ling: Usually rename stage can help us to resolve most of WAR, WAW,
but we use %sil, instead of %esi, which is related with patial
register access.
i remember mov xmm0, r32/64  will cause cross-domain operation, it is
not good on nehalem, i may test whether it exists on haswell.
>
>
>> >> +	ja	L(gobble_big_data)
>> >> +	mov	%rax, %r9
>> >> +	mov	%esi, %eax
>> >> +	mov	%rdx, %rcx
>> >> +	rep	stosb
>> >> +	mov	%r9, %rax
>> >> +	vzeroupper
>> >> +	ret
>> >> +
>> > Redundant vzeroupper.
>> Ling, we touched ymm0 operation before we go to the place :
>> +	vinserti128 $1, %xmm0, %ymm0, %ymm0
>> +	vmovups	%ymm0, (%rdi)
>> so we have to clean up upper parts of ymm0, otherwise following xmm0
>> operation have to  be impacted by SAVE penalty.
>>
> You do not need that. Relevant code is
>
> +L(256bytesormore):
> +       vinserti128 $1, %xmm0, %ymm0, %ymm0
> +       vmovups %ymm0, (%rdi)
> +       mov     %rdi, %r9
> +       and     $-0x20, %rdi
> +       add     $32, %rdi
> +       sub     %rdi, %r9
> +       add     %r9, %rdx
> +       cmp     $4096, %rdx
> +       ja      L(gobble_data)
>
> A simple reshuftling avoids that and again
>
> +       cmp     $4096, %rdx
> +       ja      L(gobble_data)
> +       vinserti128 $1, %xmm0, %ymm0, %ymm0
> +       vmovups %ymm0, (%rdi)
> +       mov     %rdi, %r9
> +       and     $-0x20, %rdi
> +       add     $32, %rdi
> +       sub     %rdi, %r9
> +       add     %r9, %rdx
Ling: we need use ymm0 to make destination address become 32byte
aligned, which is big helpful for rep stosb  instruction.

Follow-Ups:
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: OndÅej BÃlka

References:
- [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: ling . ma . program
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: OndÅej BÃlka
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: Ling Ma
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]