This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PING^2][PATCH neleai/string-x64] Improve strcpy sse2 and avx2 implementation


On Wed, Jun 24, 2015 at 10:13:31AM +0200, OndÅej BÃlka wrote:
> On Wed, Jun 17, 2015 at 08:01:05PM +0200, OndÅej BÃlka wrote:
> > Hi,
> > 
> > I wrote new strcpy on x64 and for some reason I thought that I had
> > commited it and forgot to ping it.
> > 
> > As there are other routines that I could improve I will use branch
> > neleai/string-x64 to collect these.
> > 
> > Here is revised version of what I sumbitted in 2013. Main change is that
> > I now target i7 instead core2 That simplifies things as unaligned loads
> > are cheap instead bit slower than aligned ones on core2. That mainly
> > concerns header as for core2 you could get better performance by
> > aligning loads or stores to 16 bytes after first bytes were read. I do
> > not know whats better I would need to test it.
> > 
> > That also makes less important support of ssse3 variant. I could send it
> > but it was one of my list on TODO list that now probably lost
> > importance. Problem is that on x64 for aligning by ssse3 or sse2 with
> > shifts you need to make 16 loops for each alignment as you don't have
> > variable shift. Also it needs to use jump table thats very expensive
> > For strcpy thats dubious as it increases instruction cache pressure 
> > and most copies are small. You would need to do switching from unaligned 
> > loads to aligning. I needed to do profiling to select correct treshold.
> > 
> > If somebody is interested in optimizing old pentiums4, athlon64 I will
> > provide a ssse3 variant that is also 50% faster than current one.
> > That is also reason why I omitted drawing current ssse3 implementation
> > performance.
> > 
> > 
> > In this version header first checks 128 bytes unaligned unless they
> > cross page boundary. That allows more effective loop as then at end of
> > loop we could simply write last 64 bytes instead specialcasing to avoid
> > writing before start.
> > 
> > I tried several variants of header, as we first read 16 bytes to xmm0
> > register question is if they could be reused. I used evolver to select
> > best variant, there was almost no difference in performance between
> > these.
> > 
> > Now I do checks for bytes 0-15, then 16-31, then 32-63, then 64-128.
> > There is possibility to get some cycles with different grouping, I will
> > post later improvement if I could find something. 
> > 
> > 
> > First problem was reading ahead. A rereading 8 bytes looked bit faster
> > than move from xmm.
> > 
> > Then I tried when to reuse/reread. In 4-7 byte case it was faster reread
> > than using bit shifts to get second half. For 1-3 bytes I use following
> > copy with s[0] and s[1] from rdx register with byte shifts.
> > 
> >   Test branch vs this branchless that works for i 0,1,2
> >    d[i] = 0;
> >    d[i/2] = s[1];
> >    d[0] = s[0];
> > 
> > I also added a avx2 loop. Problem why I shouldn't use them in headers
> > was high latency. I could test if using them for bytes 64-128 would give
> > speedup.
> > 
> > As technical issues go I needed to move old strcpy_sse_unaligned
> > implementation into strncpy_sse2_unaligned as strncpy is function that
> > should be optimized for size, not performance. For now I this will keep
> > these unchanged.
> > 
> > As performance these are 15%-30% faster than current one for gcc workload on
> > haswell and ivy bridge. 
> > 
> > As avx2 version its currently 6% on this workload mainly as its bash and
> > has lot of large loads so avx2 loop helps.
> > 
> > I used my profiler to show improvement, see here
> > 
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile.html
> > 
> > and source is here
> > 
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile170615.tar.bz2
> > 
> > Comments?
> > 
> >         * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
> > 	Add __strcpy_avx2 and __stpcpy_avx2
> >         * sysdeps/x86_64/multiarch/Makefile (routines): Add stpcpy_avx2.S and 
> > 	strcpy_avx2.S
> >         * sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file
> >         * sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise.
> >         * sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S: Refactored
> > 	implementation.
> >         * sysdeps/x86_64/multiarch/strcpy.S: Updated ifunc.
> >         * sysdeps/x86_64/multiarch/strncpy.S: Moved from strcpy.S.
> >         * sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S: Moved
> > 	strcpy-sse2-unaligned.S here.
> >         * sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S: Likewise.
> >         * sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S: Redirect
> > 	from strcpy-sse2-unaligned.S to strncpy-sse2-unaligned.S 
> >         * sysdeps/x86_64/multiarch/stpncpy.S: Likewise.
> >         * sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Likewise.
> > 
> > ---
> >  sysdeps/x86_64/multiarch/Makefile                 |    2 +-
> >  sysdeps/x86_64/multiarch/ifunc-impl-list.c        |    2 +
> >  sysdeps/x86_64/multiarch/stpcpy-avx2.S            |    3 +
> >  sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S  |  439 ++++-
> >  sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S |    3 +-
> >  sysdeps/x86_64/multiarch/stpncpy.S                |    5 +-
> >  sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S  |    2 +-
> >  sysdeps/x86_64/multiarch/strcpy-avx2.S            |    4 +
> >  sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S  | 1890 +-------------------
> >  sysdeps/x86_64/multiarch/strcpy.S                 |   22 +-
> >  sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S | 1891 ++++++++++++++++++++-
> >  sysdeps/x86_64/multiarch/strncpy.S                |   88 +-
> >  14 files changed, 2435 insertions(+), 1921 deletions(-)
> >  create mode 100644 sysdeps/x86_64/multiarch/stpcpy-avx2.S
> >  create mode 100644 sysdeps/x86_64/multiarch/strcpy-avx2.S
> > 
> > 
> > diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
> > index d7002a9..c573744 100644
> > --- a/sysdeps/x86_64/multiarch/Makefile
> > +++ b/sysdeps/x86_64/multiarch/Makefile
> > @@ -29,7 +29,7 @@ CFLAGS-strspn-c.c += -msse4
> >  endif
> >  
> >  ifeq (yes,$(config-cflags-avx2))
> > -sysdep_routines += memset-avx2
> > +sysdep_routines += memset-avx2 strcpy-avx2 stpcpy-avx2
> >  endif
> >  endif
> >  
> > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > index b64e4f1..d398e43 100644
> > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > @@ -88,6 +88,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> >  
> >    /* Support sysdeps/x86_64/multiarch/stpcpy.S.  */
> >    IFUNC_IMPL (i, name, stpcpy,
> > +	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __stpcpy_avx2)
> >  	      IFUNC_IMPL_ADD (array, i, stpcpy, HAS_SSSE3, __stpcpy_ssse3)
> >  	      IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2_unaligned)
> >  	      IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2))
> > @@ -137,6 +138,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> >  
> >    /* Support sysdeps/x86_64/multiarch/strcpy.S.  */
> >    IFUNC_IMPL (i, name, strcpy,
> > +	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __strcpy_avx2)
> >  	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_SSSE3, __strcpy_ssse3)
> >  	      IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2_unaligned)
> >  	      IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2))
> > diff --git a/sysdeps/x86_64/multiarch/stpcpy-avx2.S b/sysdeps/x86_64/multiarch/stpcpy-avx2.S
> > new file mode 100644
> > index 0000000..bd30ef6
> > --- /dev/null
> > +++ b/sysdeps/x86_64/multiarch/stpcpy-avx2.S
> > @@ -0,0 +1,3 @@
> > +#define USE_AVX2
> > +#define STPCPY __stpcpy_avx2
> > +#include "stpcpy-sse2-unaligned.S"
> > diff --git a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
> > index 34231f8..695a236 100644
> > --- a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
> > +++ b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
> > @@ -1,3 +1,436 @@
> > -#define USE_AS_STPCPY
> > -#define STRCPY __stpcpy_sse2_unaligned
> > -#include "strcpy-sse2-unaligned.S"
> > +/* stpcpy with SSE2 and unaligned load
> > +   Copyright (C) 2015 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#include <sysdep.h>
> > +#ifndef STPCPY
> > +# define STPCPY __stpcpy_sse2_unaligned
> > +#endif
> > +
> > +ENTRY(STPCPY)
> > +	mov	%esi, %edx
> > +#ifdef AS_STRCPY
> > +	movq    %rdi, %rax
> > +#endif
> > +	pxor	%xmm4, %xmm4
> > +	pxor	%xmm5, %xmm5
> > +	andl	$4095, %edx
> > +	cmp	$3968, %edx
> > +	ja	L(cross_page)
> > +
> > +	movdqu	(%rsi), %xmm0
> > +	pcmpeqb	%xmm0, %xmm4
> > +	pmovmskb %xmm4, %edx
> > +	testl	%edx, %edx
> > +	je	L(more16bytes)
> > +	bsf	%edx, %ecx
> > +#ifndef AS_STRCPY
> > +	lea	(%rdi, %rcx), %rax
> > +#endif
> > +	cmp	$7, %ecx
> > +	movq	(%rsi), %rdx
> > +	jb	L(less_8_bytesb)
> > +L(8bytes_from_cross):
> > +	movq	-7(%rsi, %rcx), %rsi
> > +	movq	%rdx, (%rdi)
> > +#ifdef AS_STRCPY
> > +	movq    %rsi, -7(%rdi, %rcx)
> > +#else
> > +	movq	%rsi, -7(%rax)
> > +#endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(less_8_bytesb):
> > +	cmp	$2, %ecx
> > +	jbe	L(less_4_bytes)
> > +L(4bytes_from_cross):
> > +	mov	-3(%rsi, %rcx), %esi
> > +	mov	%edx, (%rdi)
> > +#ifdef AS_STRCPY
> > +        mov     %esi, -3(%rdi, %rcx)
> > +#else
> > +	mov	%esi, -3(%rax)
> > +#endif
> > +	ret
> > +
> > +.p2align 4
> > + L(less_4_bytes):
> > + /*
> > +  Test branch vs this branchless that works for i 0,1,2
> > +   d[i] = 0;
> > +   d[i/2] = s[1];
> > +   d[0] = s[0];
> > +  */
> > +#ifdef AS_STRCPY
> > +	movb	$0, (%rdi, %rcx)
> > +#endif
> > +
> > +	shr	$1, %ecx
> > +	mov	%edx, %esi
> > +	shr	$8, %edx
> > +	movb	%dl, (%rdi, %rcx)
> > +#ifndef AS_STRCPY
> > +	movb	$0, (%rax)
> > +#endif
> > +	movb	%sil, (%rdi)
> > +	ret
> > +
> > +
> > +
> > +
> > +
> > +	.p2align 4
> > +L(more16bytes):
> > +	pxor	%xmm6, %xmm6
> > +	movdqu	16(%rsi), %xmm1
> > +	pxor	%xmm7, %xmm7
> > +	pcmpeqb	%xmm1, %xmm5
> > +	pmovmskb %xmm5, %edx
> > +	testl	%edx, %edx
> > +	je	L(more32bytes)
> > +	bsf	%edx, %edx
> > +#ifdef AS_STRCPY
> > +        movdqu  1(%rsi, %rdx), %xmm1
> > +        movdqu  %xmm0, (%rdi)
> > +	movdqu  %xmm1, 1(%rdi, %rdx)
> > +#else
> > +	lea	16(%rdi, %rdx), %rax
> > +	movdqu	1(%rsi, %rdx), %xmm1
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm1, -15(%rax)
> > +#endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(more32bytes):
> > +	movdqu	32(%rsi), %xmm2
> > +	movdqu	48(%rsi), %xmm3
> > +
> > +	pcmpeqb	%xmm2, %xmm6
> > +	pcmpeqb	%xmm3, %xmm7
> > +	pmovmskb %xmm7, %edx
> > +	shl	$16, %edx
> > +	pmovmskb %xmm6, %ecx
> > +	or	%ecx, %edx
> > +	je	L(more64bytes)
> > +	bsf	%edx, %edx
> > +#ifndef AS_STRCPY
> > +	lea	32(%rdi, %rdx), %rax
> > +#endif
> > +	movdqu	1(%rsi, %rdx), %xmm2
> > +	movdqu	17(%rsi, %rdx), %xmm3
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm1, 16(%rdi)
> > +#ifdef AS_STRCPY
> > +        movdqu  %xmm2, 1(%rdi, %rdx)
> > +        movdqu  %xmm3, 17(%rdi, %rdx)
> > +#else
> > +	movdqu	%xmm2, -31(%rax)
> > +	movdqu	%xmm3, -15(%rax)
> > +#endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(more64bytes):
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm1, 16(%rdi)
> > +	movdqu	%xmm2, 32(%rdi)
> > +	movdqu	%xmm3, 48(%rdi)
> > +	movdqu	64(%rsi), %xmm0
> > +	movdqu	80(%rsi), %xmm1
> > +	movdqu	96(%rsi), %xmm2
> > +	movdqu	112(%rsi), %xmm3
> > +
> > +	pcmpeqb	%xmm0, %xmm4
> > +	pcmpeqb	%xmm1, %xmm5
> > +	pcmpeqb	%xmm2, %xmm6
> > +	pcmpeqb	%xmm3, %xmm7
> > +	pmovmskb %xmm4, %ecx
> > +	pmovmskb %xmm5, %edx
> > +	pmovmskb %xmm6, %r8d
> > +	pmovmskb %xmm7, %r9d
> > +	shl	$16, %edx
> > +	or	%ecx, %edx
> > +	shl	$32, %r8
> > +	shl	$48, %r9
> > +	or	%r8, %rdx
> > +	or	%r9, %rdx
> > +	test	%rdx, %rdx
> > +	je	L(prepare_loop)
> > +	bsf	%rdx, %rdx
> > +#ifndef AS_STRCPY
> > +	lea	64(%rdi, %rdx), %rax
> > +#endif
> > +	movdqu	1(%rsi, %rdx), %xmm0
> > +	movdqu	17(%rsi, %rdx), %xmm1
> > +	movdqu	33(%rsi, %rdx), %xmm2
> > +	movdqu	49(%rsi, %rdx), %xmm3
> > +#ifdef AS_STRCPY
> > +        movdqu  %xmm0, 1(%rdi, %rdx)
> > +        movdqu  %xmm1, 17(%rdi, %rdx)
> > +        movdqu  %xmm2, 33(%rdi, %rdx)
> > +        movdqu  %xmm3, 49(%rdi, %rdx)
> > +#else
> > +	movdqu	%xmm0, -63(%rax)
> > +	movdqu	%xmm1, -47(%rax)
> > +	movdqu	%xmm2, -31(%rax)
> > +	movdqu	%xmm3, -15(%rax)
> > +#endif
> > +	ret
> > +
> > +
> > +	.p2align 4
> > +L(prepare_loop):
> > +	movdqu	%xmm0, 64(%rdi)
> > +	movdqu	%xmm1, 80(%rdi)
> > +	movdqu	%xmm2, 96(%rdi)
> > +	movdqu	%xmm3, 112(%rdi)
> > +
> > +	subq	%rsi, %rdi
> > +	add	$64, %rsi
> > +	andq	$-64, %rsi
> > +	addq	%rsi, %rdi
> > +	jmp	L(loop_entry)
> > +
> > +#ifdef USE_AVX2
> > +	.p2align 4
> > +L(loop):
> > +	vmovdqu	%ymm1, (%rdi)
> > +	vmovdqu	%ymm3, 32(%rdi)
> > +L(loop_entry):
> > +	vmovdqa	96(%rsi), %ymm3
> > +	vmovdqa	64(%rsi), %ymm1
> > +	vpminub	%ymm3, %ymm1, %ymm2
> > +	addq	$64, %rsi
> > +	addq	$64, %rdi
> > +	vpcmpeqb %ymm5, %ymm2, %ymm0
> > +	vpmovmskb %ymm0, %edx
> > +	test	%edx, %edx
> > +	je	L(loop)
> > +	salq	$32, %rdx
> > +	vpcmpeqb %ymm5, %ymm1, %ymm4
> > +	vpmovmskb %ymm4, %ecx
> > +	or	%rcx, %rdx
> > +	bsfq	%rdx, %rdx
> > +#ifndef AS_STRCPY
> > +	lea	(%rdi, %rdx), %rax
> > +#endif
> > +	vmovdqu	-63(%rsi, %rdx), %ymm0
> > +	vmovdqu	-31(%rsi, %rdx), %ymm2
> > +#ifdef AS_STRCPY
> > +        vmovdqu  %ymm0, -63(%rdi, %rdx)
> > +        vmovdqu  %ymm2, -31(%rdi, %rdx)
> > +#else
> > +	vmovdqu	%ymm0, -63(%rax)
> > +	vmovdqu	%ymm2, -31(%rax)
> > +#endif
> > +	vzeroupper
> > +	ret
> > +#else
> > +	.p2align 4
> > +L(loop):
> > +	movdqu	%xmm1, (%rdi)
> > +	movdqu	%xmm2, 16(%rdi)
> > +	movdqu	%xmm3, 32(%rdi)
> > +	movdqu	%xmm4, 48(%rdi)
> > +L(loop_entry):
> > +	movdqa	96(%rsi), %xmm3
> > +	movdqa	112(%rsi), %xmm4
> > +	movdqa	%xmm3, %xmm0
> > +	movdqa	80(%rsi), %xmm2
> > +	pminub	%xmm4, %xmm0
> > +	movdqa	64(%rsi), %xmm1
> > +	pminub	%xmm2, %xmm0
> > +	pminub	%xmm1, %xmm0
> > +	addq	$64, %rsi
> > +	addq	$64, %rdi
> > +	pcmpeqb	%xmm5, %xmm0
> > +	pmovmskb %xmm0, %edx
> > +	test	%edx, %edx
> > +	je	L(loop)
> > +	salq	$48, %rdx
> > +	pcmpeqb	%xmm1, %xmm5
> > +	pcmpeqb	%xmm2, %xmm6
> > +	pmovmskb %xmm5, %ecx
> > +#ifdef AS_STRCPY
> > +	pmovmskb %xmm6, %r8d
> > +	pcmpeqb	%xmm3, %xmm7
> > +	pmovmskb %xmm7, %r9d
> > +	sal	$16, %r8d
> > +	or	%r8d, %ecx
> > +#else
> > +	pmovmskb %xmm6, %eax
> > +	pcmpeqb	%xmm3, %xmm7
> > +	pmovmskb %xmm7, %r9d
> > +	sal	$16, %eax
> > +	or	%eax, %ecx
> > +#endif
> > +	salq	$32, %r9
> > +	orq	%rcx, %rdx
> > +	orq	%r9, %rdx
> > +	bsfq	%rdx, %rdx
> > +#ifndef AS_STRCPY
> > +	lea	(%rdi, %rdx), %rax
> > +#endif
> > +	movdqu	-63(%rsi, %rdx), %xmm0
> > +	movdqu	-47(%rsi, %rdx), %xmm1
> > +	movdqu	-31(%rsi, %rdx), %xmm2
> > +	movdqu	-15(%rsi, %rdx), %xmm3
> > +#ifdef AS_STRCPY
> > +        movdqu  %xmm0, -63(%rdi, %rdx)
> > +        movdqu  %xmm1, -47(%rdi, %rdx)
> > +        movdqu  %xmm2, -31(%rdi, %rdx)
> > +        movdqu  %xmm3, -15(%rdi, %rdx)
> > +#else
> > +	movdqu	%xmm0, -63(%rax)
> > +	movdqu	%xmm1, -47(%rax)
> > +	movdqu	%xmm2, -31(%rax)
> > +	movdqu	%xmm3, -15(%rax)
> > +#endif
> > +	ret
> > +#endif
> > +
> > +	.p2align 4
> > +L(cross_page):
> > +	movq	%rsi, %rcx
> > +	pxor	%xmm0, %xmm0
> > +	and	$15, %ecx
> > +	movq	%rsi, %r9
> > +	movq	%rdi, %r10
> > +	subq	%rcx, %rsi
> > +	subq	%rcx, %rdi
> > +	movdqa	(%rsi), %xmm1
> > +	pcmpeqb	%xmm0, %xmm1
> > +	pmovmskb %xmm1, %edx
> > +	shr	%cl, %edx
> > +	shl	%cl, %edx
> > +	test	%edx, %edx
> > +	jne	L(less_32_cross)
> > +
> > +	addq	$16, %rsi
> > +	addq	$16, %rdi
> > +	movdqa	(%rsi), %xmm1
> > +	pcmpeqb	%xmm1, %xmm0
> > +	pmovmskb %xmm0, %edx
> > +	test	%edx, %edx
> > +	jne	L(less_32_cross)
> > +	movdqu	%xmm1, (%rdi)
> > +
> > +	movdqu	(%r9), %xmm0
> > +	movdqu	%xmm0, (%r10)
> > +
> > +	mov	$8, %rcx
> > +L(cross_loop):
> > +	addq	$16, %rsi
> > +	addq	$16, %rdi
> > +	pxor	%xmm0, %xmm0
> > +	movdqa	(%rsi), %xmm1
> > +	pcmpeqb	%xmm1, %xmm0
> > +	pmovmskb %xmm0, %edx
> > +	test	%edx, %edx
> > +	jne	L(return_cross)
> > +	movdqu	%xmm1, (%rdi)
> > +	sub	$1, %rcx
> > +	ja	L(cross_loop)
> > +
> > +	pxor	%xmm5, %xmm5
> > +	pxor	%xmm6, %xmm6
> > +	pxor	%xmm7, %xmm7
> > +
> > +	lea	-64(%rsi), %rdx
> > +	andq	$-64, %rdx
> > +	addq	%rdx, %rdi
> > +	subq	%rsi, %rdi
> > +	movq	%rdx, %rsi
> > +	jmp	L(loop_entry)
> > +
> > +	.p2align 4
> > +L(return_cross):
> > +	bsf	%edx, %edx
> > +#ifdef AS_STRCPY
> > +        movdqu  -15(%rsi, %rdx), %xmm0
> > +        movdqu  %xmm0, -15(%rdi, %rdx)
> > +#else
> > +	lea	(%rdi, %rdx), %rax
> > +	movdqu	-15(%rsi, %rdx), %xmm0
> > +	movdqu	%xmm0, -15(%rax)
> > +#endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(less_32_cross):
> > +	bsf	%rdx, %rdx
> > +	lea	(%rdi, %rdx), %rcx
> > +#ifndef AS_STRCPY
> > +	mov	%rcx, %rax
> > +#endif
> > +	mov	%r9, %rsi
> > +	mov	%r10, %rdi
> > +	sub	%rdi, %rcx
> > +	cmp	$15, %ecx
> > +	jb	L(less_16_cross)
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	-15(%rsi, %rcx), %xmm1
> > +	movdqu	%xmm0, (%rdi)
> > +#ifdef AS_STRCPY
> > +	movdqu  %xmm1, -15(%rdi, %rcx)
> > +#else
> > +	movdqu	%xmm1, -15(%rax)
> > +#endif
> > +	ret
> > +
> > +L(less_16_cross):
> > +	cmp	$7, %ecx
> > +	jb	L(less_8_bytes_cross)
> > +	movq	(%rsi), %rdx
> > +	jmp	L(8bytes_from_cross)
> > +
> > +L(less_8_bytes_cross):
> > +	cmp	$2, %ecx
> > +	jbe	L(3_bytes_cross)
> > +	mov	(%rsi), %edx
> > +	jmp	L(4bytes_from_cross)
> > +
> > +L(3_bytes_cross):
> > +	jb	L(1_2bytes_cross)
> > +	movzwl	(%rsi), %edx
> > +	jmp	L(_3_bytesb)
> > +
> > +L(1_2bytes_cross):
> > +	movb	(%rsi), %dl
> > +	jmp	L(0_2bytes_from_cross)
> > +
> > +	.p2align 4
> > +L(less_4_bytesb):
> > +	je	L(_3_bytesb)
> > +L(0_2bytes_from_cross):
> > +	movb	%dl, (%rdi)
> > +#ifdef AS_STRCPY
> > +	movb    $0, (%rdi, %rcx)
> > +#else
> > +	movb	$0, (%rax)
> > +#endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(_3_bytesb):
> > +	movw	%dx, (%rdi)
> > +	movb	$0, 2(%rdi)
> > +	ret
> > +
> > +END(STPCPY)
> > diff --git a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
> > index 658520f..3f35068 100644
> > --- a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
> > +++ b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
> > @@ -1,4 +1,3 @@
> >  #define USE_AS_STPCPY
> > -#define USE_AS_STRNCPY
> >  #define STRCPY __stpncpy_sse2_unaligned
> > -#include "strcpy-sse2-unaligned.S"
> > +#include "strncpy-sse2-unaligned.S"
> > diff --git a/sysdeps/x86_64/multiarch/stpncpy.S b/sysdeps/x86_64/multiarch/stpncpy.S
> > index 2698ca6..159604a 100644
> > --- a/sysdeps/x86_64/multiarch/stpncpy.S
> > +++ b/sysdeps/x86_64/multiarch/stpncpy.S
> > @@ -1,8 +1,7 @@
> >  /* Multiple versions of stpncpy
> >     All versions must be listed in ifunc-impl-list.c.  */
> > -#define STRCPY __stpncpy
> > +#define STRNCPY __stpncpy
> >  #define USE_AS_STPCPY
> > -#define USE_AS_STRNCPY
> > -#include "strcpy.S"
> > +#include "strncpy.S"
> >  
> >  weak_alias (__stpncpy, stpncpy)
> > diff --git a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> > index 81f1b40..1faa49d 100644
> > --- a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> > +++ b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> > @@ -275,5 +275,5 @@ L(StartStrcpyPart):
> >  #  define USE_AS_STRNCPY
> >  # endif
> >  
> > -# include "strcpy-sse2-unaligned.S"
> > +# include "strncpy-sse2-unaligned.S"
> >  #endif
> > diff --git a/sysdeps/x86_64/multiarch/strcpy-avx2.S b/sysdeps/x86_64/multiarch/strcpy-avx2.S
> > new file mode 100644
> > index 0000000..a3133a4
> > --- /dev/null
> > +++ b/sysdeps/x86_64/multiarch/strcpy-avx2.S
> > @@ -0,0 +1,4 @@
> > +#define USE_AVX2
> > +#define AS_STRCPY
> > +#define STPCPY __strcpy_avx2
> > +#include "stpcpy-sse2-unaligned.S"
> > diff --git a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
> > index 8f03d1d..310e4fa 100644
> > --- a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
> > +++ b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
> > @@ -1,1887 +1,3 @@
> > -/* strcpy with SSE2 and unaligned load
> > -   Copyright (C) 2011-2015 Free Software Foundation, Inc.
> > -   Contributed by Intel Corporation.
> > -   This file is part of the GNU C Library.
> > -
> > -   The GNU C Library is free software; you can redistribute it and/or
> > -   modify it under the terms of the GNU Lesser General Public
> > -   License as published by the Free Software Foundation; either
> > -   version 2.1 of the License, or (at your option) any later version.
> > -
> > -   The GNU C Library is distributed in the hope that it will be useful,
> > -   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > -   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > -   Lesser General Public License for more details.
> > -
> > -   You should have received a copy of the GNU Lesser General Public
> > -   License along with the GNU C Library; if not, see
> > -   <http://www.gnu.org/licenses/>.  */
> > -
> > -#if IS_IN (libc)
> > -
> > -# ifndef USE_AS_STRCAT
> > -#  include <sysdep.h>
> > -
> > -#  ifndef STRCPY
> > -#   define STRCPY  __strcpy_sse2_unaligned
> > -#  endif
> > -
> > -# endif
> > -
> > -# define JMPTBL(I, B)	I - B
> > -# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE)             \
> > -	lea	TABLE(%rip), %r11;                              \
> > -	movslq	(%r11, INDEX, SCALE), %rcx;                     \
> > -	lea	(%r11, %rcx), %rcx;                             \
> > -	jmp	*%rcx
> > -
> > -# ifndef USE_AS_STRCAT
> > -
> > -.text
> > -ENTRY (STRCPY)
> > -#  ifdef USE_AS_STRNCPY
> > -	mov	%rdx, %r8
> > -	test	%r8, %r8
> > -	jz	L(ExitZero)
> > -#  endif
> > -	mov	%rsi, %rcx
> > -#  ifndef USE_AS_STPCPY
> > -	mov	%rdi, %rax      /* save result */
> > -#  endif
> > -
> > -# endif
> > -
> > -	and	$63, %rcx
> > -	cmp	$32, %rcx
> > -	jbe	L(SourceStringAlignmentLess32)
> > -
> > -	and	$-16, %rsi
> > -	and	$15, %rcx
> > -	pxor	%xmm0, %xmm0
> > -	pxor	%xmm1, %xmm1
> > -
> > -	pcmpeqb	(%rsi), %xmm1
> > -	pmovmskb %xmm1, %rdx
> > -	shr	%cl, %rdx
> > -
> > -# ifdef USE_AS_STRNCPY
> > -#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> > -	mov	$16, %r10
> > -	sub	%rcx, %r10
> > -	cmp	%r10, %r8
> > -#  else
> > -	mov	$17, %r10
> > -	sub	%rcx, %r10
> > -	cmp	%r10, %r8
> > -#  endif
> > -	jbe	L(CopyFrom1To16BytesTailCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesTail)
> > -
> > -	pcmpeqb	16(%rsi), %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -
> > -# ifdef USE_AS_STRNCPY
> > -	add	$16, %r10
> > -	cmp	%r10, %r8
> > -	jbe	L(CopyFrom1To32BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To32Bytes)
> > -
> > -	movdqu	(%rsi, %rcx), %xmm1   /* copy 16 bytes */
> > -	movdqu	%xmm1, (%rdi)
> > -
> > -/* If source address alignment != destination address alignment */
> > -	.p2align 4
> > -L(Unalign16Both):
> > -	sub	%rcx, %rdi
> > -# ifdef USE_AS_STRNCPY
> > -	add	%rcx, %r8
> > -# endif
> > -	mov	$16, %rcx
> > -	movdqa	(%rsi, %rcx), %xmm1
> > -	movaps	16(%rsi, %rcx), %xmm2
> > -	movdqu	%xmm1, (%rdi, %rcx)
> > -	pcmpeqb	%xmm2, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$16, %rcx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$48, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
> > -# else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -# endif
> > -
> > -	movaps	16(%rsi, %rcx), %xmm3
> > -	movdqu	%xmm2, (%rdi, %rcx)
> > -	pcmpeqb	%xmm3, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$16, %rcx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
> > -# else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -# endif
> > -
> > -	movaps	16(%rsi, %rcx), %xmm4
> > -	movdqu	%xmm3, (%rdi, %rcx)
> > -	pcmpeqb	%xmm4, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$16, %rcx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
> > -# else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -# endif
> > -
> > -	movaps	16(%rsi, %rcx), %xmm1
> > -	movdqu	%xmm4, (%rdi, %rcx)
> > -	pcmpeqb	%xmm1, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$16, %rcx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm1)
> > -# else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -# endif
> > -
> > -	movaps	16(%rsi, %rcx), %xmm2
> > -	movdqu	%xmm1, (%rdi, %rcx)
> > -	pcmpeqb	%xmm2, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$16, %rcx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
> > -# else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -# endif
> > -
> > -	movaps	16(%rsi, %rcx), %xmm3
> > -	movdqu	%xmm2, (%rdi, %rcx)
> > -	pcmpeqb	%xmm3, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$16, %rcx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
> > -# else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -# endif
> > -
> > -	movdqu	%xmm3, (%rdi, %rcx)
> > -	mov	%rsi, %rdx
> > -	lea	16(%rsi, %rcx), %rsi
> > -	and	$-0x40, %rsi
> > -	sub	%rsi, %rdx
> > -	sub	%rdx, %rdi
> > -# ifdef USE_AS_STRNCPY
> > -	lea	128(%r8, %rdx), %r8
> > -# endif
> > -L(Unaligned64Loop):
> > -	movaps	(%rsi), %xmm2
> > -	movaps	%xmm2, %xmm4
> > -	movaps	16(%rsi), %xmm5
> > -	movaps	32(%rsi), %xmm3
> > -	movaps	%xmm3, %xmm6
> > -	movaps	48(%rsi), %xmm7
> > -	pminub	%xmm5, %xmm2
> > -	pminub	%xmm7, %xmm3
> > -	pminub	%xmm2, %xmm3
> > -	pcmpeqb	%xmm0, %xmm3
> > -	pmovmskb %xmm3, %rdx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$64, %r8
> > -	jbe	L(UnalignedLeaveCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -	jnz	L(Unaligned64Leave)
> > -
> > -L(Unaligned64Loop_start):
> > -	add	$64, %rdi
> > -	add	$64, %rsi
> > -	movdqu	%xmm4, -64(%rdi)
> > -	movaps	(%rsi), %xmm2
> > -	movdqa	%xmm2, %xmm4
> > -	movdqu	%xmm5, -48(%rdi)
> > -	movaps	16(%rsi), %xmm5
> > -	pminub	%xmm5, %xmm2
> > -	movaps	32(%rsi), %xmm3
> > -	movdqu	%xmm6, -32(%rdi)
> > -	movaps	%xmm3, %xmm6
> > -	movdqu	%xmm7, -16(%rdi)
> > -	movaps	48(%rsi), %xmm7
> > -	pminub	%xmm7, %xmm3
> > -	pminub	%xmm2, %xmm3
> > -	pcmpeqb	%xmm0, %xmm3
> > -	pmovmskb %xmm3, %rdx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$64, %r8
> > -	jbe	L(UnalignedLeaveCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -	jz	L(Unaligned64Loop_start)
> > -
> > -L(Unaligned64Leave):
> > -	pxor	%xmm1, %xmm1
> > -
> > -	pcmpeqb	%xmm4, %xmm0
> > -	pcmpeqb	%xmm5, %xmm1
> > -	pmovmskb %xmm0, %rdx
> > -	pmovmskb %xmm1, %rcx
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesUnaligned_0)
> > -	test	%rcx, %rcx
> > -	jnz	L(CopyFrom1To16BytesUnaligned_16)
> > -
> > -	pcmpeqb	%xmm6, %xmm0
> > -	pcmpeqb	%xmm7, %xmm1
> > -	pmovmskb %xmm0, %rdx
> > -	pmovmskb %xmm1, %rcx
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesUnaligned_32)
> > -
> > -	bsf	%rcx, %rdx
> > -	movdqu	%xmm4, (%rdi)
> > -	movdqu	%xmm5, 16(%rdi)
> > -	movdqu	%xmm6, 32(%rdi)
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -# ifdef USE_AS_STPCPY
> > -	lea	48(%rdi, %rdx), %rax
> > -# endif
> > -	movdqu	%xmm7, 48(%rdi)
> > -	add	$15, %r8
> > -	sub	%rdx, %r8
> > -	lea	49(%rdi, %rdx), %rdi
> > -	jmp	L(StrncpyFillTailWithZero)
> > -# else
> > -	add	$48, %rsi
> > -	add	$48, %rdi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -# endif
> > -
> > -/* If source address alignment == destination address alignment */
> > -
> > -L(SourceStringAlignmentLess32):
> > -	pxor	%xmm0, %xmm0
> > -	movdqu	(%rsi), %xmm1
> > -	movdqu	16(%rsi), %xmm2
> > -	pcmpeqb	%xmm1, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -
> > -# ifdef USE_AS_STRNCPY
> > -#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> > -	cmp	$16, %r8
> > -#  else
> > -	cmp	$17, %r8
> > -#  endif
> > -	jbe	L(CopyFrom1To16BytesTail1Case2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesTail1)
> > -
> > -	pcmpeqb	%xmm2, %xmm0
> > -	movdqu	%xmm1, (%rdi)
> > -	pmovmskb %xmm0, %rdx
> > -
> > -# ifdef USE_AS_STRNCPY
> > -#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> > -	cmp	$32, %r8
> > -#  else
> > -	cmp	$33, %r8
> > -#  endif
> > -	jbe	L(CopyFrom1To32Bytes1Case2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To32Bytes1)
> > -
> > -	and	$-16, %rsi
> > -	and	$15, %rcx
> > -	jmp	L(Unalign16Both)
> > -
> > -/*------End of main part with loops---------------------*/
> > -
> > -/* Case1 */
> > -
> > -# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT)
> > -	.p2align 4
> > -L(CopyFrom1To16Bytes):
> > -	add	%rcx, %rdi
> > -	add	%rcx, %rsi
> > -	bsf	%rdx, %rdx
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -# endif
> > -	.p2align 4
> > -L(CopyFrom1To16BytesTail):
> > -	add	%rcx, %rsi
> > -	bsf	%rdx, %rdx
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To32Bytes1):
> > -	add	$16, %rsi
> > -	add	$16, %rdi
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$16, %r8
> > -# endif
> > -L(CopyFrom1To16BytesTail1):
> > -	bsf	%rdx, %rdx
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To32Bytes):
> > -	bsf	%rdx, %rdx
> > -	add	%rcx, %rsi
> > -	add	$16, %rdx
> > -	sub	%rcx, %rdx
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnaligned_0):
> > -	bsf	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -# ifdef USE_AS_STPCPY
> > -	lea	(%rdi, %rdx), %rax
> > -# endif
> > -	movdqu	%xmm4, (%rdi)
> > -	add	$63, %r8
> > -	sub	%rdx, %r8
> > -	lea	1(%rdi, %rdx), %rdi
> > -	jmp	L(StrncpyFillTailWithZero)
> > -# else
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -# endif
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnaligned_16):
> > -	bsf	%rcx, %rdx
> > -	movdqu	%xmm4, (%rdi)
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -# ifdef USE_AS_STPCPY
> > -	lea	16(%rdi, %rdx), %rax
> > -# endif
> > -	movdqu	%xmm5, 16(%rdi)
> > -	add	$47, %r8
> > -	sub	%rdx, %r8
> > -	lea	17(%rdi, %rdx), %rdi
> > -	jmp	L(StrncpyFillTailWithZero)
> > -# else
> > -	add	$16, %rsi
> > -	add	$16, %rdi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -# endif
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnaligned_32):
> > -	bsf	%rdx, %rdx
> > -	movdqu	%xmm4, (%rdi)
> > -	movdqu	%xmm5, 16(%rdi)
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -# ifdef USE_AS_STPCPY
> > -	lea	32(%rdi, %rdx), %rax
> > -# endif
> > -	movdqu	%xmm6, 32(%rdi)
> > -	add	$31, %r8
> > -	sub	%rdx, %r8
> > -	lea	33(%rdi, %rdx), %rdi
> > -	jmp	L(StrncpyFillTailWithZero)
> > -# else
> > -	add	$32, %rsi
> > -	add	$32, %rdi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -# endif
> > -
> > -# ifdef USE_AS_STRNCPY
> > -#  ifndef USE_AS_STRCAT
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnalignedXmm6):
> > -	movdqu	%xmm6, (%rdi, %rcx)
> > -	jmp	L(CopyFrom1To16BytesXmmExit)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnalignedXmm5):
> > -	movdqu	%xmm5, (%rdi, %rcx)
> > -	jmp	L(CopyFrom1To16BytesXmmExit)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnalignedXmm4):
> > -	movdqu	%xmm4, (%rdi, %rcx)
> > -	jmp	L(CopyFrom1To16BytesXmmExit)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnalignedXmm3):
> > -	movdqu	%xmm3, (%rdi, %rcx)
> > -	jmp	L(CopyFrom1To16BytesXmmExit)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnalignedXmm1):
> > -	movdqu	%xmm1, (%rdi, %rcx)
> > -	jmp	L(CopyFrom1To16BytesXmmExit)
> > -#  endif
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesExit):
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -
> > -/* Case2 */
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesCase2):
> > -	add	$16, %r8
> > -	add	%rcx, %rdi
> > -	add	%rcx, %rsi
> > -	bsf	%rdx, %rdx
> > -	cmp	%r8, %rdx
> > -	jb	L(CopyFrom1To16BytesExit)
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To32BytesCase2):
> > -	add	%rcx, %rsi
> > -	bsf	%rdx, %rdx
> > -	add	$16, %rdx
> > -	sub	%rcx, %rdx
> > -	cmp	%r8, %rdx
> > -	jb	L(CopyFrom1To16BytesExit)
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -L(CopyFrom1To16BytesTailCase2):
> > -	add	%rcx, %rsi
> > -	bsf	%rdx, %rdx
> > -	cmp	%r8, %rdx
> > -	jb	L(CopyFrom1To16BytesExit)
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -L(CopyFrom1To16BytesTail1Case2):
> > -	bsf	%rdx, %rdx
> > -	cmp	%r8, %rdx
> > -	jb	L(CopyFrom1To16BytesExit)
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -/* Case2 or Case3,  Case3 */
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesCase2OrCase3):
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesCase2)
> > -L(CopyFrom1To16BytesCase3):
> > -	add	$16, %r8
> > -	add	%rcx, %rdi
> > -	add	%rcx, %rsi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To32BytesCase2OrCase3):
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To32BytesCase2)
> > -	add	%rcx, %rsi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesTailCase2OrCase3):
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesTailCase2)
> > -	add	%rcx, %rsi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To32Bytes1Case2OrCase3):
> > -	add	$16, %rdi
> > -	add	$16, %rsi
> > -	sub	$16, %r8
> > -L(CopyFrom1To16BytesTail1Case2OrCase3):
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesTail1Case2)
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -# endif
> > -
> > -/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/
> > -
> > -	.p2align 4
> > -L(Exit1):
> > -	mov	%dh, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$1, %r8
> > -	lea	1(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit2):
> > -	mov	(%rsi), %dx
> > -	mov	%dx, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	1(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$2, %r8
> > -	lea	2(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit3):
> > -	mov	(%rsi), %cx
> > -	mov	%cx, (%rdi)
> > -	mov	%dh, 2(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	2(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$3, %r8
> > -	lea	3(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit4):
> > -	mov	(%rsi), %edx
> > -	mov	%edx, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	3(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$4, %r8
> > -	lea	4(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit5):
> > -	mov	(%rsi), %ecx
> > -	mov	%dh, 4(%rdi)
> > -	mov	%ecx, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	4(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$5, %r8
> > -	lea	5(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit6):
> > -	mov	(%rsi), %ecx
> > -	mov	4(%rsi), %dx
> > -	mov	%ecx, (%rdi)
> > -	mov	%dx, 4(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	5(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$6, %r8
> > -	lea	6(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit7):
> > -	mov	(%rsi), %ecx
> > -	mov	3(%rsi), %edx
> > -	mov	%ecx, (%rdi)
> > -	mov	%edx, 3(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	6(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$7, %r8
> > -	lea	7(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit8):
> > -	mov	(%rsi), %rdx
> > -	mov	%rdx, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	7(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$8, %r8
> > -	lea	8(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit9):
> > -	mov	(%rsi), %rcx
> > -	mov	%dh, 8(%rdi)
> > -	mov	%rcx, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	8(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$9, %r8
> > -	lea	9(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit10):
> > -	mov	(%rsi), %rcx
> > -	mov	8(%rsi), %dx
> > -	mov	%rcx, (%rdi)
> > -	mov	%dx, 8(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	9(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$10, %r8
> > -	lea	10(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit11):
> > -	mov	(%rsi), %rcx
> > -	mov	7(%rsi), %edx
> > -	mov	%rcx, (%rdi)
> > -	mov	%edx, 7(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	10(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$11, %r8
> > -	lea	11(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit12):
> > -	mov	(%rsi), %rcx
> > -	mov	8(%rsi), %edx
> > -	mov	%rcx, (%rdi)
> > -	mov	%edx, 8(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	11(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$12, %r8
> > -	lea	12(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit13):
> > -	mov	(%rsi), %rcx
> > -	mov	5(%rsi), %rdx
> > -	mov	%rcx, (%rdi)
> > -	mov	%rdx, 5(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	12(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$13, %r8
> > -	lea	13(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit14):
> > -	mov	(%rsi), %rcx
> > -	mov	6(%rsi), %rdx
> > -	mov	%rcx, (%rdi)
> > -	mov	%rdx, 6(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	13(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$14, %r8
> > -	lea	14(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit15):
> > -	mov	(%rsi), %rcx
> > -	mov	7(%rsi), %rdx
> > -	mov	%rcx, (%rdi)
> > -	mov	%rdx, 7(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	14(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$15, %r8
> > -	lea	15(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit16):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	%xmm0, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	15(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$16, %r8
> > -	lea	16(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit17):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%dh, 16(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	16(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$17, %r8
> > -	lea	17(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit18):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %cx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%cx, 16(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	17(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$18, %r8
> > -	lea	18(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit19):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	15(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%ecx, 15(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	18(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$19, %r8
> > -	lea	19(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit20):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%ecx, 16(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	19(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$20, %r8
> > -	lea	20(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit21):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%ecx, 16(%rdi)
> > -	mov	%dh, 20(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	20(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$21, %r8
> > -	lea	21(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit22):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	14(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 14(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	21(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$22, %r8
> > -	lea	22(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit23):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	15(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 15(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	22(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$23, %r8
> > -	lea	23(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit24):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 16(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	23(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$24, %r8
> > -	lea	24(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit25):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 16(%rdi)
> > -	mov	%dh, 24(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	24(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$25, %r8
> > -	lea	25(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit26):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	24(%rsi), %cx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%cx, 24(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	25(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$26, %r8
> > -	lea	26(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit27):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	23(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%ecx, 23(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	26(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$27, %r8
> > -	lea	27(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit28):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	24(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%ecx, 24(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	27(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$28, %r8
> > -	lea	28(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit29):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	13(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 13(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	28(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$29, %r8
> > -	lea	29(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit30):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	14(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 14(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	29(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$30, %r8
> > -	lea	30(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit31):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	15(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 15(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	30(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$31, %r8
> > -	lea	31(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit32):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	16(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 16(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	31(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$32, %r8
> > -	lea	32(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -# ifdef USE_AS_STRNCPY
> > -
> > -	.p2align 4
> > -L(StrncpyExit0):
> > -#  ifdef USE_AS_STPCPY
> > -	mov	%rdi, %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, (%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit1):
> > -	mov	(%rsi), %dl
> > -	mov	%dl, (%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	1(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 1(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit2):
> > -	mov	(%rsi), %dx
> > -	mov	%dx, (%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	2(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 2(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit3):
> > -	mov	(%rsi), %cx
> > -	mov	2(%rsi), %dl
> > -	mov	%cx, (%rdi)
> > -	mov	%dl, 2(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	3(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 3(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit4):
> > -	mov	(%rsi), %edx
> > -	mov	%edx, (%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	4(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 4(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit5):
> > -	mov	(%rsi), %ecx
> > -	mov	4(%rsi), %dl
> > -	mov	%ecx, (%rdi)
> > -	mov	%dl, 4(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	5(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 5(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit6):
> > -	mov	(%rsi), %ecx
> > -	mov	4(%rsi), %dx
> > -	mov	%ecx, (%rdi)
> > -	mov	%dx, 4(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	6(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 6(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit7):
> > -	mov	(%rsi), %ecx
> > -	mov	3(%rsi), %edx
> > -	mov	%ecx, (%rdi)
> > -	mov	%edx, 3(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	7(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 7(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit8):
> > -	mov	(%rsi), %rdx
> > -	mov	%rdx, (%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	8(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 8(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit9):
> > -	mov	(%rsi), %rcx
> > -	mov	8(%rsi), %dl
> > -	mov	%rcx, (%rdi)
> > -	mov	%dl, 8(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	9(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 9(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit10):
> > -	mov	(%rsi), %rcx
> > -	mov	8(%rsi), %dx
> > -	mov	%rcx, (%rdi)
> > -	mov	%dx, 8(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	10(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 10(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit11):
> > -	mov	(%rsi), %rcx
> > -	mov	7(%rsi), %edx
> > -	mov	%rcx, (%rdi)
> > -	mov	%edx, 7(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	11(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 11(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit12):
> > -	mov	(%rsi), %rcx
> > -	mov	8(%rsi), %edx
> > -	mov	%rcx, (%rdi)
> > -	mov	%edx, 8(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	12(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 12(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit13):
> > -	mov	(%rsi), %rcx
> > -	mov	5(%rsi), %rdx
> > -	mov	%rcx, (%rdi)
> > -	mov	%rdx, 5(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	13(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 13(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit14):
> > -	mov	(%rsi), %rcx
> > -	mov	6(%rsi), %rdx
> > -	mov	%rcx, (%rdi)
> > -	mov	%rdx, 6(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	14(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 14(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit15):
> > -	mov	(%rsi), %rcx
> > -	mov	7(%rsi), %rdx
> > -	mov	%rcx, (%rdi)
> > -	mov	%rdx, 7(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	15(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 15(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit16):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	%xmm0, (%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	16(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 16(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit17):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %cl
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%cl, 16(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	17(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 17(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit18):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %cx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%cx, 16(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	18(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 18(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit19):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	15(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%ecx, 15(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	19(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 19(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit20):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%ecx, 16(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	20(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 20(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit21):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %ecx
> > -	mov	20(%rsi), %dl
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%ecx, 16(%rdi)
> > -	mov	%dl, 20(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	21(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 21(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit22):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	14(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 14(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	22(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 22(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit23):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	15(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 15(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	23(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 23(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit24):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 16(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	24(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 24(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit25):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	24(%rsi), %cl
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%cl, 24(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	25(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 25(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit26):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	24(%rsi), %cx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%cx, 24(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	26(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 26(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit27):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	23(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%ecx, 23(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	27(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 27(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit28):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	24(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%ecx, 24(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	28(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 28(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit29):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	13(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 13(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	29(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 29(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit30):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	14(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 14(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	30(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 30(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit31):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	15(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 15(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	31(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 31(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit32):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	16(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 16(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	32(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 32(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit33):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	16(%rsi), %xmm2
> > -	mov	32(%rsi), %cl
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 16(%rdi)
> > -	mov	%cl, 32(%rdi)
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 33(%rdi)
> > -#  endif
> > -	ret
> > -
> > -#  ifndef USE_AS_STRCAT
> > -
> > -	.p2align 4
> > -L(Fill0):
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill1):
> > -	mov	%dl, (%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill2):
> > -	mov	%dx, (%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill3):
> > -	mov	%edx, -1(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill4):
> > -	mov	%edx, (%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill5):
> > -	mov	%edx, (%rdi)
> > -	mov	%dl, 4(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill6):
> > -	mov	%edx, (%rdi)
> > -	mov	%dx, 4(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill7):
> > -	mov	%rdx, -1(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill8):
> > -	mov	%rdx, (%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill9):
> > -	mov	%rdx, (%rdi)
> > -	mov	%dl, 8(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill10):
> > -	mov	%rdx, (%rdi)
> > -	mov	%dx, 8(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill11):
> > -	mov	%rdx, (%rdi)
> > -	mov	%edx, 7(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill12):
> > -	mov	%rdx, (%rdi)
> > -	mov	%edx, 8(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill13):
> > -	mov	%rdx, (%rdi)
> > -	mov	%rdx, 5(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill14):
> > -	mov	%rdx, (%rdi)
> > -	mov	%rdx, 6(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill15):
> > -	movdqu	%xmm0, -1(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill16):
> > -	movdqu	%xmm0, (%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnalignedXmm2):
> > -	movdqu	%xmm2, (%rdi, %rcx)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesXmmExit):
> > -	bsf	%rdx, %rdx
> > -	add	$15, %r8
> > -	add	%rcx, %rdi
> > -#   ifdef USE_AS_STPCPY
> > -	lea	(%rdi, %rdx), %rax
> > -#   endif
> > -	sub	%rdx, %r8
> > -	lea	1(%rdi, %rdx), %rdi
> > -
> > -	.p2align 4
> > -L(StrncpyFillTailWithZero):
> > -	pxor	%xmm0, %xmm0
> > -	xor	%rdx, %rdx
> > -	sub	$16, %r8
> > -	jbe	L(StrncpyFillExit)
> > -
> > -	movdqu	%xmm0, (%rdi)
> > -	add	$16, %rdi
> > -
> > -	mov	%rdi, %rsi
> > -	and	$0xf, %rsi
> > -	sub	%rsi, %rdi
> > -	add	%rsi, %r8
> > -	sub	$64, %r8
> > -	jb	L(StrncpyFillLess64)
> > -
> > -L(StrncpyFillLoopMovdqa):
> > -	movdqa	%xmm0, (%rdi)
> > -	movdqa	%xmm0, 16(%rdi)
> > -	movdqa	%xmm0, 32(%rdi)
> > -	movdqa	%xmm0, 48(%rdi)
> > -	add	$64, %rdi
> > -	sub	$64, %r8
> > -	jae	L(StrncpyFillLoopMovdqa)
> > -
> > -L(StrncpyFillLess64):
> > -	add	$32, %r8
> > -	jl	L(StrncpyFillLess32)
> > -	movdqa	%xmm0, (%rdi)
> > -	movdqa	%xmm0, 16(%rdi)
> > -	add	$32, %rdi
> > -	sub	$16, %r8
> > -	jl	L(StrncpyFillExit)
> > -	movdqa	%xmm0, (%rdi)
> > -	add	$16, %rdi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> > -
> > -L(StrncpyFillLess32):
> > -	add	$16, %r8
> > -	jl	L(StrncpyFillExit)
> > -	movdqa	%xmm0, (%rdi)
> > -	add	$16, %rdi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> > -
> > -L(StrncpyFillExit):
> > -	add	$16, %r8
> > -	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> > -
> > -/* end of ifndef USE_AS_STRCAT */
> > -#  endif
> > -
> > -	.p2align 4
> > -L(UnalignedLeaveCase2OrCase3):
> > -	test	%rdx, %rdx
> > -	jnz	L(Unaligned64LeaveCase2)
> > -L(Unaligned64LeaveCase3):
> > -	lea	64(%r8), %rcx
> > -	and	$-16, %rcx
> > -	add	$48, %r8
> > -	jl	L(CopyFrom1To16BytesCase3)
> > -	movdqu	%xmm4, (%rdi)
> > -	sub	$16, %r8
> > -	jb	L(CopyFrom1To16BytesCase3)
> > -	movdqu	%xmm5, 16(%rdi)
> > -	sub	$16, %r8
> > -	jb	L(CopyFrom1To16BytesCase3)
> > -	movdqu	%xmm6, 32(%rdi)
> > -	sub	$16, %r8
> > -	jb	L(CopyFrom1To16BytesCase3)
> > -	movdqu	%xmm7, 48(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	64(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 64(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Unaligned64LeaveCase2):
> > -	xor	%rcx, %rcx
> > -	pcmpeqb	%xmm4, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$48, %r8
> > -	jle	L(CopyFrom1To16BytesCase2OrCase3)
> > -	test	%rdx, %rdx
> > -#  ifndef USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
> > -#  else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -#  endif
> > -	pcmpeqb	%xmm5, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	movdqu	%xmm4, (%rdi)
> > -	add	$16, %rcx
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -	test	%rdx, %rdx
> > -#  ifndef USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm5)
> > -#  else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -#  endif
> > -
> > -	pcmpeqb	%xmm6, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	movdqu	%xmm5, 16(%rdi)
> > -	add	$16, %rcx
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -	test	%rdx, %rdx
> > -#  ifndef USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm6)
> > -#  else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -#  endif
> > -
> > -	pcmpeqb	%xmm7, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	movdqu	%xmm6, 32(%rdi)
> > -	lea	16(%rdi, %rcx), %rdi
> > -	lea	16(%rsi, %rcx), %rsi
> > -	bsf	%rdx, %rdx
> > -	cmp	%r8, %rdx
> > -	jb	L(CopyFrom1To16BytesExit)
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -	.p2align 4
> > -L(ExitZero):
> > -#  ifndef USE_AS_STRCAT
> > -	mov	%rdi, %rax
> > -#  endif
> > -	ret
> > -
> > -# endif
> > -
> > -# ifndef USE_AS_STRCAT
> > -END (STRCPY)
> > -# else
> > -END (STRCAT)
> > -# endif
> > -	.p2align 4
> > -	.section .rodata
> > -L(ExitTable):
> > -	.int	JMPTBL(L(Exit1), L(ExitTable))
> > -	.int	JMPTBL(L(Exit2), L(ExitTable))
> > -	.int	JMPTBL(L(Exit3), L(ExitTable))
> > -	.int	JMPTBL(L(Exit4), L(ExitTable))
> > -	.int	JMPTBL(L(Exit5), L(ExitTable))
> > -	.int	JMPTBL(L(Exit6), L(ExitTable))
> > -	.int	JMPTBL(L(Exit7), L(ExitTable))
> > -	.int	JMPTBL(L(Exit8), L(ExitTable))
> > -	.int	JMPTBL(L(Exit9), L(ExitTable))
> > -	.int	JMPTBL(L(Exit10), L(ExitTable))
> > -	.int	JMPTBL(L(Exit11), L(ExitTable))
> > -	.int	JMPTBL(L(Exit12), L(ExitTable))
> > -	.int	JMPTBL(L(Exit13), L(ExitTable))
> > -	.int	JMPTBL(L(Exit14), L(ExitTable))
> > -	.int	JMPTBL(L(Exit15), L(ExitTable))
> > -	.int	JMPTBL(L(Exit16), L(ExitTable))
> > -	.int	JMPTBL(L(Exit17), L(ExitTable))
> > -	.int	JMPTBL(L(Exit18), L(ExitTable))
> > -	.int	JMPTBL(L(Exit19), L(ExitTable))
> > -	.int	JMPTBL(L(Exit20), L(ExitTable))
> > -	.int	JMPTBL(L(Exit21), L(ExitTable))
> > -	.int	JMPTBL(L(Exit22), L(ExitTable))
> > -	.int    JMPTBL(L(Exit23), L(ExitTable))
> > -	.int	JMPTBL(L(Exit24), L(ExitTable))
> > -	.int	JMPTBL(L(Exit25), L(ExitTable))
> > -	.int	JMPTBL(L(Exit26), L(ExitTable))
> > -	.int	JMPTBL(L(Exit27), L(ExitTable))
> > -	.int	JMPTBL(L(Exit28), L(ExitTable))
> > -	.int	JMPTBL(L(Exit29), L(ExitTable))
> > -	.int	JMPTBL(L(Exit30), L(ExitTable))
> > -	.int	JMPTBL(L(Exit31), L(ExitTable))
> > -	.int	JMPTBL(L(Exit32), L(ExitTable))
> > -# ifdef USE_AS_STRNCPY
> > -L(ExitStrncpyTable):
> > -	.int	JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable))
> > -	.int    JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable))
> > -#  ifndef USE_AS_STRCAT
> > -	.p2align 4
> > -L(FillTable):
> > -	.int	JMPTBL(L(Fill0), L(FillTable))
> > -	.int	JMPTBL(L(Fill1), L(FillTable))
> > -	.int	JMPTBL(L(Fill2), L(FillTable))
> > -	.int	JMPTBL(L(Fill3), L(FillTable))
> > -	.int	JMPTBL(L(Fill4), L(FillTable))
> > -	.int	JMPTBL(L(Fill5), L(FillTable))
> > -	.int	JMPTBL(L(Fill6), L(FillTable))
> > -	.int	JMPTBL(L(Fill7), L(FillTable))
> > -	.int	JMPTBL(L(Fill8), L(FillTable))
> > -	.int	JMPTBL(L(Fill9), L(FillTable))
> > -	.int	JMPTBL(L(Fill10), L(FillTable))
> > -	.int	JMPTBL(L(Fill11), L(FillTable))
> > -	.int	JMPTBL(L(Fill12), L(FillTable))
> > -	.int	JMPTBL(L(Fill13), L(FillTable))
> > -	.int	JMPTBL(L(Fill14), L(FillTable))
> > -	.int	JMPTBL(L(Fill15), L(FillTable))
> > -	.int	JMPTBL(L(Fill16), L(FillTable))
> > -#  endif
> > -# endif
> > -#endif
> > +#define AS_STRCPY
> > +#define STPCPY __strcpy_sse2_unaligned
> > +#include "stpcpy-sse2-unaligned.S"
> > diff --git a/sysdeps/x86_64/multiarch/strcpy.S b/sysdeps/x86_64/multiarch/strcpy.S
> > index 9464ee8..92be04c 100644
> > --- a/sysdeps/x86_64/multiarch/strcpy.S
> > +++ b/sysdeps/x86_64/multiarch/strcpy.S
> > @@ -28,31 +28,18 @@
> >  #endif
> >  
> >  #ifdef USE_AS_STPCPY
> > -# ifdef USE_AS_STRNCPY
> > -#  define STRCPY_SSSE3		__stpncpy_ssse3
> > -#  define STRCPY_SSE2		__stpncpy_sse2
> > -#  define STRCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned
> > -#  define __GI_STRCPY		__GI_stpncpy
> > -#  define __GI___STRCPY		__GI___stpncpy
> > -# else
> >  #  define STRCPY_SSSE3		__stpcpy_ssse3
> >  #  define STRCPY_SSE2		__stpcpy_sse2
> > +#  define STRCPY_AVX2		__stpcpy_avx2
> >  #  define STRCPY_SSE2_UNALIGNED	__stpcpy_sse2_unaligned
> >  #  define __GI_STRCPY		__GI_stpcpy
> >  #  define __GI___STRCPY		__GI___stpcpy
> > -# endif
> >  #else
> > -# ifdef USE_AS_STRNCPY
> > -#  define STRCPY_SSSE3		__strncpy_ssse3
> > -#  define STRCPY_SSE2		__strncpy_sse2
> > -#  define STRCPY_SSE2_UNALIGNED	__strncpy_sse2_unaligned
> > -#  define __GI_STRCPY		__GI_strncpy
> > -# else
> >  #  define STRCPY_SSSE3		__strcpy_ssse3
> > +#  define STRCPY_AVX2		__strcpy_avx2
> >  #  define STRCPY_SSE2		__strcpy_sse2
> >  #  define STRCPY_SSE2_UNALIGNED	__strcpy_sse2_unaligned
> >  #  define __GI_STRCPY		__GI_strcpy
> > -# endif
> >  #endif
> >  
> >  
> > @@ -64,7 +51,10 @@ ENTRY(STRCPY)
> >  	cmpl	$0, __cpu_features+KIND_OFFSET(%rip)
> >  	jne	1f
> >  	call	__init_cpu_features
> > -1:	leaq	STRCPY_SSE2_UNALIGNED(%rip), %rax
> > +1:	leaq	STRCPY_AVX2(%rip), %rax
> > +	testl   $bit_AVX_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_AVX_Fast_Unaligned_Load(%rip)
> > +	jnz	2f
> > +	leaq	STRCPY_SSE2_UNALIGNED(%rip), %rax
> >  	testl	$bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip)
> >  	jnz	2f
> >  	leaq	STRCPY_SSE2(%rip), %rax
> > diff --git a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
> > index fcc23a7..e4c98e7 100644
> > --- a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
> > +++ b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
> > @@ -1,3 +1,1888 @@
> > -#define USE_AS_STRNCPY
> > -#define STRCPY __strncpy_sse2_unaligned
> > -#include "strcpy-sse2-unaligned.S"
> > +/* strcpy with SSE2 and unaligned load
> > +   Copyright (C) 2011-2015 Free Software Foundation, Inc.
> > +   Contributed by Intel Corporation.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#if IS_IN (libc)
> > +
> > +# ifndef USE_AS_STRCAT
> > +#  include <sysdep.h>
> > +
> > +#  ifndef STRCPY
> > +#   define STRCPY  __strncpy_sse2_unaligned
> > +#  endif
> > +
> > +# define USE_AS_STRNCPY
> > +# endif
> > +
> > +# define JMPTBL(I, B)	I - B
> > +# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE)             \
> > +	lea	TABLE(%rip), %r11;                              \
> > +	movslq	(%r11, INDEX, SCALE), %rcx;                     \
> > +	lea	(%r11, %rcx), %rcx;                             \
> > +	jmp	*%rcx
> > +
> > +# ifndef USE_AS_STRCAT
> > +
> > +.text
> > +ENTRY (STRCPY)
> > +#  ifdef USE_AS_STRNCPY
> > +	mov	%rdx, %r8
> > +	test	%r8, %r8
> > +	jz	L(ExitZero)
> > +#  endif
> > +	mov	%rsi, %rcx
> > +#  ifndef USE_AS_STPCPY
> > +	mov	%rdi, %rax      /* save result */
> > +#  endif
> > +
> > +# endif
> > +
> > +	and	$63, %rcx
> > +	cmp	$32, %rcx
> > +	jbe	L(SourceStringAlignmentLess32)
> > +
> > +	and	$-16, %rsi
> > +	and	$15, %rcx
> > +	pxor	%xmm0, %xmm0
> > +	pxor	%xmm1, %xmm1
> > +
> > +	pcmpeqb	(%rsi), %xmm1
> > +	pmovmskb %xmm1, %rdx
> > +	shr	%cl, %rdx
> > +
> > +# ifdef USE_AS_STRNCPY
> > +#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> > +	mov	$16, %r10
> > +	sub	%rcx, %r10
> > +	cmp	%r10, %r8
> > +#  else
> > +	mov	$17, %r10
> > +	sub	%rcx, %r10
> > +	cmp	%r10, %r8
> > +#  endif
> > +	jbe	L(CopyFrom1To16BytesTailCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesTail)
> > +
> > +	pcmpeqb	16(%rsi), %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +
> > +# ifdef USE_AS_STRNCPY
> > +	add	$16, %r10
> > +	cmp	%r10, %r8
> > +	jbe	L(CopyFrom1To32BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To32Bytes)
> > +
> > +	movdqu	(%rsi, %rcx), %xmm1   /* copy 16 bytes */
> > +	movdqu	%xmm1, (%rdi)
> > +
> > +/* If source address alignment != destination address alignment */
> > +	.p2align 4
> > +L(Unalign16Both):
> > +	sub	%rcx, %rdi
> > +# ifdef USE_AS_STRNCPY
> > +	add	%rcx, %r8
> > +# endif
> > +	mov	$16, %rcx
> > +	movdqa	(%rsi, %rcx), %xmm1
> > +	movaps	16(%rsi, %rcx), %xmm2
> > +	movdqu	%xmm1, (%rdi, %rcx)
> > +	pcmpeqb	%xmm2, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$16, %rcx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$48, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
> > +# else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +# endif
> > +
> > +	movaps	16(%rsi, %rcx), %xmm3
> > +	movdqu	%xmm2, (%rdi, %rcx)
> > +	pcmpeqb	%xmm3, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$16, %rcx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
> > +# else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +# endif
> > +
> > +	movaps	16(%rsi, %rcx), %xmm4
> > +	movdqu	%xmm3, (%rdi, %rcx)
> > +	pcmpeqb	%xmm4, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$16, %rcx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
> > +# else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +# endif
> > +
> > +	movaps	16(%rsi, %rcx), %xmm1
> > +	movdqu	%xmm4, (%rdi, %rcx)
> > +	pcmpeqb	%xmm1, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$16, %rcx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm1)
> > +# else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +# endif
> > +
> > +	movaps	16(%rsi, %rcx), %xmm2
> > +	movdqu	%xmm1, (%rdi, %rcx)
> > +	pcmpeqb	%xmm2, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$16, %rcx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
> > +# else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +# endif
> > +
> > +	movaps	16(%rsi, %rcx), %xmm3
> > +	movdqu	%xmm2, (%rdi, %rcx)
> > +	pcmpeqb	%xmm3, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$16, %rcx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
> > +# else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +# endif
> > +
> > +	movdqu	%xmm3, (%rdi, %rcx)
> > +	mov	%rsi, %rdx
> > +	lea	16(%rsi, %rcx), %rsi
> > +	and	$-0x40, %rsi
> > +	sub	%rsi, %rdx
> > +	sub	%rdx, %rdi
> > +# ifdef USE_AS_STRNCPY
> > +	lea	128(%r8, %rdx), %r8
> > +# endif
> > +L(Unaligned64Loop):
> > +	movaps	(%rsi), %xmm2
> > +	movaps	%xmm2, %xmm4
> > +	movaps	16(%rsi), %xmm5
> > +	movaps	32(%rsi), %xmm3
> > +	movaps	%xmm3, %xmm6
> > +	movaps	48(%rsi), %xmm7
> > +	pminub	%xmm5, %xmm2
> > +	pminub	%xmm7, %xmm3
> > +	pminub	%xmm2, %xmm3
> > +	pcmpeqb	%xmm0, %xmm3
> > +	pmovmskb %xmm3, %rdx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$64, %r8
> > +	jbe	L(UnalignedLeaveCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +	jnz	L(Unaligned64Leave)
> > +
> > +L(Unaligned64Loop_start):
> > +	add	$64, %rdi
> > +	add	$64, %rsi
> > +	movdqu	%xmm4, -64(%rdi)
> > +	movaps	(%rsi), %xmm2
> > +	movdqa	%xmm2, %xmm4
> > +	movdqu	%xmm5, -48(%rdi)
> > +	movaps	16(%rsi), %xmm5
> > +	pminub	%xmm5, %xmm2
> > +	movaps	32(%rsi), %xmm3
> > +	movdqu	%xmm6, -32(%rdi)
> > +	movaps	%xmm3, %xmm6
> > +	movdqu	%xmm7, -16(%rdi)
> > +	movaps	48(%rsi), %xmm7
> > +	pminub	%xmm7, %xmm3
> > +	pminub	%xmm2, %xmm3
> > +	pcmpeqb	%xmm0, %xmm3
> > +	pmovmskb %xmm3, %rdx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$64, %r8
> > +	jbe	L(UnalignedLeaveCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +	jz	L(Unaligned64Loop_start)
> > +
> > +L(Unaligned64Leave):
> > +	pxor	%xmm1, %xmm1
> > +
> > +	pcmpeqb	%xmm4, %xmm0
> > +	pcmpeqb	%xmm5, %xmm1
> > +	pmovmskb %xmm0, %rdx
> > +	pmovmskb %xmm1, %rcx
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesUnaligned_0)
> > +	test	%rcx, %rcx
> > +	jnz	L(CopyFrom1To16BytesUnaligned_16)
> > +
> > +	pcmpeqb	%xmm6, %xmm0
> > +	pcmpeqb	%xmm7, %xmm1
> > +	pmovmskb %xmm0, %rdx
> > +	pmovmskb %xmm1, %rcx
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesUnaligned_32)
> > +
> > +	bsf	%rcx, %rdx
> > +	movdqu	%xmm4, (%rdi)
> > +	movdqu	%xmm5, 16(%rdi)
> > +	movdqu	%xmm6, 32(%rdi)
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +# ifdef USE_AS_STPCPY
> > +	lea	48(%rdi, %rdx), %rax
> > +# endif
> > +	movdqu	%xmm7, 48(%rdi)
> > +	add	$15, %r8
> > +	sub	%rdx, %r8
> > +	lea	49(%rdi, %rdx), %rdi
> > +	jmp	L(StrncpyFillTailWithZero)
> > +# else
> > +	add	$48, %rsi
> > +	add	$48, %rdi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +# endif
> > +
> > +/* If source address alignment == destination address alignment */
> > +
> > +L(SourceStringAlignmentLess32):
> > +	pxor	%xmm0, %xmm0
> > +	movdqu	(%rsi), %xmm1
> > +	movdqu	16(%rsi), %xmm2
> > +	pcmpeqb	%xmm1, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +
> > +# ifdef USE_AS_STRNCPY
> > +#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> > +	cmp	$16, %r8
> > +#  else
> > +	cmp	$17, %r8
> > +#  endif
> > +	jbe	L(CopyFrom1To16BytesTail1Case2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesTail1)
> > +
> > +	pcmpeqb	%xmm2, %xmm0
> > +	movdqu	%xmm1, (%rdi)
> > +	pmovmskb %xmm0, %rdx
> > +
> > +# ifdef USE_AS_STRNCPY
> > +#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> > +	cmp	$32, %r8
> > +#  else
> > +	cmp	$33, %r8
> > +#  endif
> > +	jbe	L(CopyFrom1To32Bytes1Case2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To32Bytes1)
> > +
> > +	and	$-16, %rsi
> > +	and	$15, %rcx
> > +	jmp	L(Unalign16Both)
> > +
> > +/*------End of main part with loops---------------------*/
> > +
> > +/* Case1 */
> > +
> > +# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT)
> > +	.p2align 4
> > +L(CopyFrom1To16Bytes):
> > +	add	%rcx, %rdi
> > +	add	%rcx, %rsi
> > +	bsf	%rdx, %rdx
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +# endif
> > +	.p2align 4
> > +L(CopyFrom1To16BytesTail):
> > +	add	%rcx, %rsi
> > +	bsf	%rdx, %rdx
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To32Bytes1):
> > +	add	$16, %rsi
> > +	add	$16, %rdi
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$16, %r8
> > +# endif
> > +L(CopyFrom1To16BytesTail1):
> > +	bsf	%rdx, %rdx
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To32Bytes):
> > +	bsf	%rdx, %rdx
> > +	add	%rcx, %rsi
> > +	add	$16, %rdx
> > +	sub	%rcx, %rdx
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnaligned_0):
> > +	bsf	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +# ifdef USE_AS_STPCPY
> > +	lea	(%rdi, %rdx), %rax
> > +# endif
> > +	movdqu	%xmm4, (%rdi)
> > +	add	$63, %r8
> > +	sub	%rdx, %r8
> > +	lea	1(%rdi, %rdx), %rdi
> > +	jmp	L(StrncpyFillTailWithZero)
> > +# else
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +# endif
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnaligned_16):
> > +	bsf	%rcx, %rdx
> > +	movdqu	%xmm4, (%rdi)
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +# ifdef USE_AS_STPCPY
> > +	lea	16(%rdi, %rdx), %rax
> > +# endif
> > +	movdqu	%xmm5, 16(%rdi)
> > +	add	$47, %r8
> > +	sub	%rdx, %r8
> > +	lea	17(%rdi, %rdx), %rdi
> > +	jmp	L(StrncpyFillTailWithZero)
> > +# else
> > +	add	$16, %rsi
> > +	add	$16, %rdi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +# endif
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnaligned_32):
> > +	bsf	%rdx, %rdx
> > +	movdqu	%xmm4, (%rdi)
> > +	movdqu	%xmm5, 16(%rdi)
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +# ifdef USE_AS_STPCPY
> > +	lea	32(%rdi, %rdx), %rax
> > +# endif
> > +	movdqu	%xmm6, 32(%rdi)
> > +	add	$31, %r8
> > +	sub	%rdx, %r8
> > +	lea	33(%rdi, %rdx), %rdi
> > +	jmp	L(StrncpyFillTailWithZero)
> > +# else
> > +	add	$32, %rsi
> > +	add	$32, %rdi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +# endif
> > +
> > +# ifdef USE_AS_STRNCPY
> > +#  ifndef USE_AS_STRCAT
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnalignedXmm6):
> > +	movdqu	%xmm6, (%rdi, %rcx)
> > +	jmp	L(CopyFrom1To16BytesXmmExit)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnalignedXmm5):
> > +	movdqu	%xmm5, (%rdi, %rcx)
> > +	jmp	L(CopyFrom1To16BytesXmmExit)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnalignedXmm4):
> > +	movdqu	%xmm4, (%rdi, %rcx)
> > +	jmp	L(CopyFrom1To16BytesXmmExit)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnalignedXmm3):
> > +	movdqu	%xmm3, (%rdi, %rcx)
> > +	jmp	L(CopyFrom1To16BytesXmmExit)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnalignedXmm1):
> > +	movdqu	%xmm1, (%rdi, %rcx)
> > +	jmp	L(CopyFrom1To16BytesXmmExit)
> > +#  endif
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesExit):
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +
> > +/* Case2 */
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesCase2):
> > +	add	$16, %r8
> > +	add	%rcx, %rdi
> > +	add	%rcx, %rsi
> > +	bsf	%rdx, %rdx
> > +	cmp	%r8, %rdx
> > +	jb	L(CopyFrom1To16BytesExit)
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To32BytesCase2):
> > +	add	%rcx, %rsi
> > +	bsf	%rdx, %rdx
> > +	add	$16, %rdx
> > +	sub	%rcx, %rdx
> > +	cmp	%r8, %rdx
> > +	jb	L(CopyFrom1To16BytesExit)
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +L(CopyFrom1To16BytesTailCase2):
> > +	add	%rcx, %rsi
> > +	bsf	%rdx, %rdx
> > +	cmp	%r8, %rdx
> > +	jb	L(CopyFrom1To16BytesExit)
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +L(CopyFrom1To16BytesTail1Case2):
> > +	bsf	%rdx, %rdx
> > +	cmp	%r8, %rdx
> > +	jb	L(CopyFrom1To16BytesExit)
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +/* Case2 or Case3,  Case3 */
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesCase2OrCase3):
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesCase2)
> > +L(CopyFrom1To16BytesCase3):
> > +	add	$16, %r8
> > +	add	%rcx, %rdi
> > +	add	%rcx, %rsi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To32BytesCase2OrCase3):
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To32BytesCase2)
> > +	add	%rcx, %rsi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesTailCase2OrCase3):
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesTailCase2)
> > +	add	%rcx, %rsi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To32Bytes1Case2OrCase3):
> > +	add	$16, %rdi
> > +	add	$16, %rsi
> > +	sub	$16, %r8
> > +L(CopyFrom1To16BytesTail1Case2OrCase3):
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesTail1Case2)
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +# endif
> > +
> > +/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/
> > +
> > +	.p2align 4
> > +L(Exit1):
> > +	mov	%dh, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$1, %r8
> > +	lea	1(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit2):
> > +	mov	(%rsi), %dx
> > +	mov	%dx, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	1(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$2, %r8
> > +	lea	2(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit3):
> > +	mov	(%rsi), %cx
> > +	mov	%cx, (%rdi)
> > +	mov	%dh, 2(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	2(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$3, %r8
> > +	lea	3(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit4):
> > +	mov	(%rsi), %edx
> > +	mov	%edx, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	3(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$4, %r8
> > +	lea	4(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit5):
> > +	mov	(%rsi), %ecx
> > +	mov	%dh, 4(%rdi)
> > +	mov	%ecx, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	4(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$5, %r8
> > +	lea	5(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit6):
> > +	mov	(%rsi), %ecx
> > +	mov	4(%rsi), %dx
> > +	mov	%ecx, (%rdi)
> > +	mov	%dx, 4(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	5(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$6, %r8
> > +	lea	6(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit7):
> > +	mov	(%rsi), %ecx
> > +	mov	3(%rsi), %edx
> > +	mov	%ecx, (%rdi)
> > +	mov	%edx, 3(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	6(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$7, %r8
> > +	lea	7(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit8):
> > +	mov	(%rsi), %rdx
> > +	mov	%rdx, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	7(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$8, %r8
> > +	lea	8(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit9):
> > +	mov	(%rsi), %rcx
> > +	mov	%dh, 8(%rdi)
> > +	mov	%rcx, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	8(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$9, %r8
> > +	lea	9(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit10):
> > +	mov	(%rsi), %rcx
> > +	mov	8(%rsi), %dx
> > +	mov	%rcx, (%rdi)
> > +	mov	%dx, 8(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	9(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$10, %r8
> > +	lea	10(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit11):
> > +	mov	(%rsi), %rcx
> > +	mov	7(%rsi), %edx
> > +	mov	%rcx, (%rdi)
> > +	mov	%edx, 7(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	10(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$11, %r8
> > +	lea	11(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit12):
> > +	mov	(%rsi), %rcx
> > +	mov	8(%rsi), %edx
> > +	mov	%rcx, (%rdi)
> > +	mov	%edx, 8(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	11(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$12, %r8
> > +	lea	12(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit13):
> > +	mov	(%rsi), %rcx
> > +	mov	5(%rsi), %rdx
> > +	mov	%rcx, (%rdi)
> > +	mov	%rdx, 5(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	12(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$13, %r8
> > +	lea	13(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit14):
> > +	mov	(%rsi), %rcx
> > +	mov	6(%rsi), %rdx
> > +	mov	%rcx, (%rdi)
> > +	mov	%rdx, 6(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	13(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$14, %r8
> > +	lea	14(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit15):
> > +	mov	(%rsi), %rcx
> > +	mov	7(%rsi), %rdx
> > +	mov	%rcx, (%rdi)
> > +	mov	%rdx, 7(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	14(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$15, %r8
> > +	lea	15(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit16):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	%xmm0, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	15(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$16, %r8
> > +	lea	16(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit17):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%dh, 16(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	16(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$17, %r8
> > +	lea	17(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit18):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %cx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%cx, 16(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	17(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$18, %r8
> > +	lea	18(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit19):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	15(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%ecx, 15(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	18(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$19, %r8
> > +	lea	19(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit20):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%ecx, 16(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	19(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$20, %r8
> > +	lea	20(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit21):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%ecx, 16(%rdi)
> > +	mov	%dh, 20(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	20(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$21, %r8
> > +	lea	21(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit22):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	14(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 14(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	21(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$22, %r8
> > +	lea	22(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit23):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	15(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 15(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	22(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$23, %r8
> > +	lea	23(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit24):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 16(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	23(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$24, %r8
> > +	lea	24(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit25):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 16(%rdi)
> > +	mov	%dh, 24(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	24(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$25, %r8
> > +	lea	25(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit26):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	24(%rsi), %cx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%cx, 24(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	25(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$26, %r8
> > +	lea	26(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit27):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	23(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%ecx, 23(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	26(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$27, %r8
> > +	lea	27(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit28):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	24(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%ecx, 24(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	27(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$28, %r8
> > +	lea	28(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit29):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	13(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 13(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	28(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$29, %r8
> > +	lea	29(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit30):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	14(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 14(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	29(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$30, %r8
> > +	lea	30(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit31):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	15(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 15(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	30(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$31, %r8
> > +	lea	31(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit32):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	16(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 16(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	31(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$32, %r8
> > +	lea	32(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +# ifdef USE_AS_STRNCPY
> > +
> > +	.p2align 4
> > +L(StrncpyExit0):
> > +#  ifdef USE_AS_STPCPY
> > +	mov	%rdi, %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, (%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit1):
> > +	mov	(%rsi), %dl
> > +	mov	%dl, (%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	1(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 1(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit2):
> > +	mov	(%rsi), %dx
> > +	mov	%dx, (%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	2(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 2(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit3):
> > +	mov	(%rsi), %cx
> > +	mov	2(%rsi), %dl
> > +	mov	%cx, (%rdi)
> > +	mov	%dl, 2(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	3(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 3(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit4):
> > +	mov	(%rsi), %edx
> > +	mov	%edx, (%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	4(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 4(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit5):
> > +	mov	(%rsi), %ecx
> > +	mov	4(%rsi), %dl
> > +	mov	%ecx, (%rdi)
> > +	mov	%dl, 4(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	5(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 5(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit6):
> > +	mov	(%rsi), %ecx
> > +	mov	4(%rsi), %dx
> > +	mov	%ecx, (%rdi)
> > +	mov	%dx, 4(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	6(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 6(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit7):
> > +	mov	(%rsi), %ecx
> > +	mov	3(%rsi), %edx
> > +	mov	%ecx, (%rdi)
> > +	mov	%edx, 3(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	7(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 7(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit8):
> > +	mov	(%rsi), %rdx
> > +	mov	%rdx, (%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	8(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 8(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit9):
> > +	mov	(%rsi), %rcx
> > +	mov	8(%rsi), %dl
> > +	mov	%rcx, (%rdi)
> > +	mov	%dl, 8(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	9(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 9(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit10):
> > +	mov	(%rsi), %rcx
> > +	mov	8(%rsi), %dx
> > +	mov	%rcx, (%rdi)
> > +	mov	%dx, 8(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	10(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 10(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit11):
> > +	mov	(%rsi), %rcx
> > +	mov	7(%rsi), %edx
> > +	mov	%rcx, (%rdi)
> > +	mov	%edx, 7(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	11(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 11(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit12):
> > +	mov	(%rsi), %rcx
> > +	mov	8(%rsi), %edx
> > +	mov	%rcx, (%rdi)
> > +	mov	%edx, 8(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	12(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 12(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit13):
> > +	mov	(%rsi), %rcx
> > +	mov	5(%rsi), %rdx
> > +	mov	%rcx, (%rdi)
> > +	mov	%rdx, 5(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	13(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 13(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit14):
> > +	mov	(%rsi), %rcx
> > +	mov	6(%rsi), %rdx
> > +	mov	%rcx, (%rdi)
> > +	mov	%rdx, 6(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	14(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 14(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit15):
> > +	mov	(%rsi), %rcx
> > +	mov	7(%rsi), %rdx
> > +	mov	%rcx, (%rdi)
> > +	mov	%rdx, 7(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	15(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 15(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit16):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	%xmm0, (%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	16(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 16(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit17):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %cl
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%cl, 16(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	17(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 17(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit18):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %cx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%cx, 16(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	18(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 18(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit19):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	15(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%ecx, 15(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	19(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 19(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit20):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%ecx, 16(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	20(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 20(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit21):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %ecx
> > +	mov	20(%rsi), %dl
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%ecx, 16(%rdi)
> > +	mov	%dl, 20(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	21(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 21(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit22):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	14(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 14(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	22(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 22(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit23):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	15(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 15(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	23(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 23(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit24):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 16(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	24(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 24(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit25):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	24(%rsi), %cl
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%cl, 24(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	25(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 25(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit26):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	24(%rsi), %cx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%cx, 24(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	26(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 26(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit27):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	23(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%ecx, 23(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	27(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 27(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit28):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	24(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%ecx, 24(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	28(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 28(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit29):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	13(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 13(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	29(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 29(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit30):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	14(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 14(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	30(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 30(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit31):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	15(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 15(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	31(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 31(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit32):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	16(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 16(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	32(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 32(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit33):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	16(%rsi), %xmm2
> > +	mov	32(%rsi), %cl
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 16(%rdi)
> > +	mov	%cl, 32(%rdi)
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 33(%rdi)
> > +#  endif
> > +	ret
> > +
> > +#  ifndef USE_AS_STRCAT
> > +
> > +	.p2align 4
> > +L(Fill0):
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill1):
> > +	mov	%dl, (%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill2):
> > +	mov	%dx, (%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill3):
> > +	mov	%edx, -1(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill4):
> > +	mov	%edx, (%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill5):
> > +	mov	%edx, (%rdi)
> > +	mov	%dl, 4(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill6):
> > +	mov	%edx, (%rdi)
> > +	mov	%dx, 4(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill7):
> > +	mov	%rdx, -1(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill8):
> > +	mov	%rdx, (%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill9):
> > +	mov	%rdx, (%rdi)
> > +	mov	%dl, 8(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill10):
> > +	mov	%rdx, (%rdi)
> > +	mov	%dx, 8(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill11):
> > +	mov	%rdx, (%rdi)
> > +	mov	%edx, 7(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill12):
> > +	mov	%rdx, (%rdi)
> > +	mov	%edx, 8(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill13):
> > +	mov	%rdx, (%rdi)
> > +	mov	%rdx, 5(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill14):
> > +	mov	%rdx, (%rdi)
> > +	mov	%rdx, 6(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill15):
> > +	movdqu	%xmm0, -1(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill16):
> > +	movdqu	%xmm0, (%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnalignedXmm2):
> > +	movdqu	%xmm2, (%rdi, %rcx)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesXmmExit):
> > +	bsf	%rdx, %rdx
> > +	add	$15, %r8
> > +	add	%rcx, %rdi
> > +#   ifdef USE_AS_STPCPY
> > +	lea	(%rdi, %rdx), %rax
> > +#   endif
> > +	sub	%rdx, %r8
> > +	lea	1(%rdi, %rdx), %rdi
> > +
> > +	.p2align 4
> > +L(StrncpyFillTailWithZero):
> > +	pxor	%xmm0, %xmm0
> > +	xor	%rdx, %rdx
> > +	sub	$16, %r8
> > +	jbe	L(StrncpyFillExit)
> > +
> > +	movdqu	%xmm0, (%rdi)
> > +	add	$16, %rdi
> > +
> > +	mov	%rdi, %rsi
> > +	and	$0xf, %rsi
> > +	sub	%rsi, %rdi
> > +	add	%rsi, %r8
> > +	sub	$64, %r8
> > +	jb	L(StrncpyFillLess64)
> > +
> > +L(StrncpyFillLoopMovdqa):
> > +	movdqa	%xmm0, (%rdi)
> > +	movdqa	%xmm0, 16(%rdi)
> > +	movdqa	%xmm0, 32(%rdi)
> > +	movdqa	%xmm0, 48(%rdi)
> > +	add	$64, %rdi
> > +	sub	$64, %r8
> > +	jae	L(StrncpyFillLoopMovdqa)
> > +
> > +L(StrncpyFillLess64):
> > +	add	$32, %r8
> > +	jl	L(StrncpyFillLess32)
> > +	movdqa	%xmm0, (%rdi)
> > +	movdqa	%xmm0, 16(%rdi)
> > +	add	$32, %rdi
> > +	sub	$16, %r8
> > +	jl	L(StrncpyFillExit)
> > +	movdqa	%xmm0, (%rdi)
> > +	add	$16, %rdi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> > +
> > +L(StrncpyFillLess32):
> > +	add	$16, %r8
> > +	jl	L(StrncpyFillExit)
> > +	movdqa	%xmm0, (%rdi)
> > +	add	$16, %rdi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> > +
> > +L(StrncpyFillExit):
> > +	add	$16, %r8
> > +	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> > +
> > +/* end of ifndef USE_AS_STRCAT */
> > +#  endif
> > +
> > +	.p2align 4
> > +L(UnalignedLeaveCase2OrCase3):
> > +	test	%rdx, %rdx
> > +	jnz	L(Unaligned64LeaveCase2)
> > +L(Unaligned64LeaveCase3):
> > +	lea	64(%r8), %rcx
> > +	and	$-16, %rcx
> > +	add	$48, %r8
> > +	jl	L(CopyFrom1To16BytesCase3)
> > +	movdqu	%xmm4, (%rdi)
> > +	sub	$16, %r8
> > +	jb	L(CopyFrom1To16BytesCase3)
> > +	movdqu	%xmm5, 16(%rdi)
> > +	sub	$16, %r8
> > +	jb	L(CopyFrom1To16BytesCase3)
> > +	movdqu	%xmm6, 32(%rdi)
> > +	sub	$16, %r8
> > +	jb	L(CopyFrom1To16BytesCase3)
> > +	movdqu	%xmm7, 48(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	64(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 64(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Unaligned64LeaveCase2):
> > +	xor	%rcx, %rcx
> > +	pcmpeqb	%xmm4, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$48, %r8
> > +	jle	L(CopyFrom1To16BytesCase2OrCase3)
> > +	test	%rdx, %rdx
> > +#  ifndef USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
> > +#  else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +#  endif
> > +	pcmpeqb	%xmm5, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	movdqu	%xmm4, (%rdi)
> > +	add	$16, %rcx
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +	test	%rdx, %rdx
> > +#  ifndef USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm5)
> > +#  else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +#  endif
> > +
> > +	pcmpeqb	%xmm6, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	movdqu	%xmm5, 16(%rdi)
> > +	add	$16, %rcx
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +	test	%rdx, %rdx
> > +#  ifndef USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm6)
> > +#  else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +#  endif
> > +
> > +	pcmpeqb	%xmm7, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	movdqu	%xmm6, 32(%rdi)
> > +	lea	16(%rdi, %rcx), %rdi
> > +	lea	16(%rsi, %rcx), %rsi
> > +	bsf	%rdx, %rdx
> > +	cmp	%r8, %rdx
> > +	jb	L(CopyFrom1To16BytesExit)
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +	.p2align 4
> > +L(ExitZero):
> > +#  ifndef USE_AS_STRCAT
> > +	mov	%rdi, %rax
> > +#  endif
> > +	ret
> > +
> > +# endif
> > +
> > +# ifndef USE_AS_STRCAT
> > +END (STRCPY)
> > +# else
> > +END (STRCAT)
> > +# endif
> > +	.p2align 4
> > +	.section .rodata
> > +L(ExitTable):
> > +	.int	JMPTBL(L(Exit1), L(ExitTable))
> > +	.int	JMPTBL(L(Exit2), L(ExitTable))
> > +	.int	JMPTBL(L(Exit3), L(ExitTable))
> > +	.int	JMPTBL(L(Exit4), L(ExitTable))
> > +	.int	JMPTBL(L(Exit5), L(ExitTable))
> > +	.int	JMPTBL(L(Exit6), L(ExitTable))
> > +	.int	JMPTBL(L(Exit7), L(ExitTable))
> > +	.int	JMPTBL(L(Exit8), L(ExitTable))
> > +	.int	JMPTBL(L(Exit9), L(ExitTable))
> > +	.int	JMPTBL(L(Exit10), L(ExitTable))
> > +	.int	JMPTBL(L(Exit11), L(ExitTable))
> > +	.int	JMPTBL(L(Exit12), L(ExitTable))
> > +	.int	JMPTBL(L(Exit13), L(ExitTable))
> > +	.int	JMPTBL(L(Exit14), L(ExitTable))
> > +	.int	JMPTBL(L(Exit15), L(ExitTable))
> > +	.int	JMPTBL(L(Exit16), L(ExitTable))
> > +	.int	JMPTBL(L(Exit17), L(ExitTable))
> > +	.int	JMPTBL(L(Exit18), L(ExitTable))
> > +	.int	JMPTBL(L(Exit19), L(ExitTable))
> > +	.int	JMPTBL(L(Exit20), L(ExitTable))
> > +	.int	JMPTBL(L(Exit21), L(ExitTable))
> > +	.int	JMPTBL(L(Exit22), L(ExitTable))
> > +	.int    JMPTBL(L(Exit23), L(ExitTable))
> > +	.int	JMPTBL(L(Exit24), L(ExitTable))
> > +	.int	JMPTBL(L(Exit25), L(ExitTable))
> > +	.int	JMPTBL(L(Exit26), L(ExitTable))
> > +	.int	JMPTBL(L(Exit27), L(ExitTable))
> > +	.int	JMPTBL(L(Exit28), L(ExitTable))
> > +	.int	JMPTBL(L(Exit29), L(ExitTable))
> > +	.int	JMPTBL(L(Exit30), L(ExitTable))
> > +	.int	JMPTBL(L(Exit31), L(ExitTable))
> > +	.int	JMPTBL(L(Exit32), L(ExitTable))
> > +# ifdef USE_AS_STRNCPY
> > +L(ExitStrncpyTable):
> > +	.int	JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable))
> > +	.int    JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable))
> > +#  ifndef USE_AS_STRCAT
> > +	.p2align 4
> > +L(FillTable):
> > +	.int	JMPTBL(L(Fill0), L(FillTable))
> > +	.int	JMPTBL(L(Fill1), L(FillTable))
> > +	.int	JMPTBL(L(Fill2), L(FillTable))
> > +	.int	JMPTBL(L(Fill3), L(FillTable))
> > +	.int	JMPTBL(L(Fill4), L(FillTable))
> > +	.int	JMPTBL(L(Fill5), L(FillTable))
> > +	.int	JMPTBL(L(Fill6), L(FillTable))
> > +	.int	JMPTBL(L(Fill7), L(FillTable))
> > +	.int	JMPTBL(L(Fill8), L(FillTable))
> > +	.int	JMPTBL(L(Fill9), L(FillTable))
> > +	.int	JMPTBL(L(Fill10), L(FillTable))
> > +	.int	JMPTBL(L(Fill11), L(FillTable))
> > +	.int	JMPTBL(L(Fill12), L(FillTable))
> > +	.int	JMPTBL(L(Fill13), L(FillTable))
> > +	.int	JMPTBL(L(Fill14), L(FillTable))
> > +	.int	JMPTBL(L(Fill15), L(FillTable))
> > +	.int	JMPTBL(L(Fill16), L(FillTable))
> > +#  endif
> > +# endif
> > +#endif
> > diff --git a/sysdeps/x86_64/multiarch/strncpy.S b/sysdeps/x86_64/multiarch/strncpy.S
> > index 6d87a0b..afbd870 100644
> > --- a/sysdeps/x86_64/multiarch/strncpy.S
> > +++ b/sysdeps/x86_64/multiarch/strncpy.S
> > @@ -1,5 +1,85 @@
> > -/* Multiple versions of strncpy
> > -   All versions must be listed in ifunc-impl-list.c.  */
> > -#define STRCPY strncpy
> > +/* Multiple versions of strcpy
> > +   All versions must be listed in ifunc-impl-list.c.
> > +   Copyright (C) 2009-2015 Free Software Foundation, Inc.
> > +   Contributed by Intel Corporation.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#include <sysdep.h>
> > +#include <init-arch.h>
> > +
> >  #define USE_AS_STRNCPY
> > -#include "strcpy.S"
> > +#ifndef STRNCPY
> > +#define STRNCPY strncpy
> > +#endif
> > +
> > +#ifdef USE_AS_STPCPY
> > +#  define STRNCPY_SSSE3		__stpncpy_ssse3
> > +#  define STRNCPY_SSE2		__stpncpy_sse2
> > +#  define STRNCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned
> > +#  define __GI_STRNCPY		__GI_stpncpy
> > +#  define __GI___STRNCPY		__GI___stpncpy
> > +#else
> > +#  define STRNCPY_SSSE3		__strncpy_ssse3
> > +#  define STRNCPY_SSE2		__strncpy_sse2
> > +#  define STRNCPY_SSE2_UNALIGNED	__strncpy_sse2_unaligned
> > +#  define __GI_STRNCPY		__GI_strncpy
> > +#endif
> > +
> > +
> > +/* Define multiple versions only for the definition in libc.  */
> > +#if IS_IN (libc)
> > +	.text
> > +ENTRY(STRNCPY)
> > +	.type	STRNCPY, @gnu_indirect_function
> > +	cmpl	$0, __cpu_features+KIND_OFFSET(%rip)
> > +	jne	1f
> > +	call	__init_cpu_features
> > +1:	leaq	STRNCPY_SSE2_UNALIGNED(%rip), %rax
> > +	testl	$bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip)
> > +	jnz	2f
> > +	leaq	STRNCPY_SSE2(%rip), %rax
> > +	testl	$bit_SSSE3, __cpu_features+CPUID_OFFSET+index_SSSE3(%rip)
> > +	jz	2f
> > +	leaq	STRNCPY_SSSE3(%rip), %rax
> > +2:	ret
> > +END(STRNCPY)
> > +
> > +# undef ENTRY
> > +# define ENTRY(name) \
> > +	.type STRNCPY_SSE2, @function; \
> > +	.align 16; \
> > +	.globl STRNCPY_SSE2; \
> > +	.hidden STRNCPY_SSE2; \
> > +	STRNCPY_SSE2: cfi_startproc; \
> > +	CALL_MCOUNT
> > +# undef END
> > +# define END(name) \
> > +	cfi_endproc; .size STRNCPY_SSE2, .-STRNCPY_SSE2
> > +# undef libc_hidden_builtin_def
> > +/* It doesn't make sense to send libc-internal strcpy calls through a PLT.
> > +   The speedup we get from using SSSE3 instruction is likely eaten away
> > +   by the indirect call in the PLT.  */
> > +# define libc_hidden_builtin_def(name) \
> > +	.globl __GI_STRNCPY; __GI_STRNCPY = STRNCPY_SSE2
> > +# undef libc_hidden_def
> > +# define libc_hidden_def(name) \
> > +	.globl __GI___STRNCPY; __GI___STRNCPY = STRNCPY_SSE2
> > +#endif
> > +
> > +#ifndef USE_AS_STRNCPY
> > +#include "../strcpy.S"
> > +#endif
> > -- 
> > 1.8.4.rc3
> 
> -- 
> 
> Communications satellite used by the military for star wars.

-- 

halon system went off and killed the operators.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]