This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug string/19776] Improve sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S


https://sourceware.org/bugzilla/show_bug.cgi?id=19776

--- Comment #40 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/ifunc has been created
        at  8073bd8f850c1b7b04921a4c921d26bbbe5fbcae (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8073bd8f850c1b7b04921a4c921d26bbbe5fbcae

commit 8073bd8f850c1b7b04921a4c921d26bbbe5fbcae
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c

    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.

        * sysdeps/x86_64/memcopy.h: New file.
        * sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3608e9eb3d04e22a8341ac6c397e52f531330ac1

commit 3608e9eb3d04e22a8341ac6c397e52f531330ac1
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove

    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.

    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.

    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove.

    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.

        [BZ #19776]
        * sysdeps/x86_64/memcpy.S: Make it dummy.
        * sysdeps/x86_64/mempcpy.S: Likewise.
        * sysdeps/x86_64/memmove.S: New file.
        * sysdeps/x86_64/memmove_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
        * sysdeps/x86_64/memmove.c: Removed.
        * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
        Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
        memcpy-sse2-unaligned, memmove-avx-unaligned,
        memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c
        (__libc_ifunc_impl_list): Replace
        __memmove_chk_avx512_unaligned_2 with
        __memmove_chk_avx512_unaligned.  Remove
        __memmove_chk_avx_unaligned_2.  Replace
        __memmove_chk_sse2_unaligned_2 with
        __memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
        __memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
        with __memmove_avx512_unaligned.  Replace
        __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
        Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
        with __memcpy_chk_avx512_unaligned.  Remove
        __memcpy_chk_avx_unaligned_2.  Replace
        __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
        Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
        Replace __memcpy_avx512_unaligned_2 with
        __memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
        and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
        with __mempcpy_chk_avx512_unaligned.  Remove
        __mempcpy_chk_avx_unaligned_2.  Replace
        __mempcpy_chk_sse2_unaligned_2 with
        __mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
        Replace __mempcpy_avx512_unaligned_2 with
        __mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
        Replace __mempcpy_sse2_unaligned_2 with
        __mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
        * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
        __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
        Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
        if processor has ERMS.  Default to __memcpy_sse2_unaligned.
        (ENTRY): Removed.
        (END): Likewise.
        (ENTRY_CHK): Likewise.
        (libc_hidden_builtin_def): Likewise.
        Don't include ../memcpy.S.
        * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
        __memcpy_chk_avx512_unaligned_erms and
        __memcpy_chk_avx512_unaligned.  Use
        __memcpy_chk_avx_unaligned_erms and
        __memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
        Default to __memcpy_chk_sse2_unaligned.
        * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if
        not in libc.
        * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
        Likewise.
        * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
        (MEMCPY_SYMBOL): New.
        (MEMPCPY_SYMBOL): Likewise.
        (MEMMOVE_CHK_SYMBOL): Likewise.
        Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
        symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
        __mempcpy symbols.  Change function suffix from unaligned_2 to
        unaligned.  Provide alias for __memcpy_chk in libc.a.  Provide
        alias for memcpy in libc.a and ld.so.
        * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
        __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
        Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
        if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
        (ENTRY): Removed.
        (END): Likewise.
        (ENTRY_CHK): Likewise.
        (libc_hidden_builtin_def): Likewise.
        Don't include ../mempcpy.S.
        (mempcpy): New.  Add a weak alias.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
        __mempcpy_chk_avx512_unaligned_erms and
        __mempcpy_chk_avx512_unaligned.  Use
        __mempcpy_chk_avx_unaligned_erms and
        __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
        Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=83823b091e18fec9151752a0429a78a0d81d6317

commit 83823b091e18fec9151752a0429a78a0d81d6317
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets

    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.

    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.

        [BZ #19881]
        * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
        into ...
        * sysdeps/x86_64/memset.S: This.
        (__bzero): Removed.
        (__memset_tail): Likewise.
        (__memset_chk): Likewise.
        (memset): Likewise.
        (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
        defined.
        (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
        * sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
        (__memset_zero_constant_len_parameter): Check SHARED instead of
        PIC.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
        memset-avx2 and memset-sse2-unaligned-erms.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c
        (__libc_ifunc_impl_list): Remove __memset_chk_sse2,
        __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
        * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Skip
        if not in libc.
        * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
        Likewise.
        * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
        (MEMSET_CHK_SYMBOL): New.  Define if not defined.
        (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
        Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
        symbols.
        Properly check USE_MULTIARCH on __memset symbols.
        * sysdeps/x86_64/multiarch/memset.S (memset): Replace
        __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
        and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
        or __memset_avx2_unaligned_erms if processor has ERMS.  Support
        __memset_avx512_unaligned_erms and __memset_avx512_unaligned.
        (memset): Removed.
        (__memset_chk): Likewise.
        (MEMSET_SYMBOL): New.
        (libc_hidden_builtin_def): Replace __memset_sse2 with
        __memset_sse2_unaligned.
        * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
        __memset_chk_sse2 and __memset_chk_avx2 with
        __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
        Use __memset_chk_sse2_unaligned_erms or
        __memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
        __memset_chk_avx512_unaligned_erms and
        __memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e39197c4403a082b1c607210211cb643830b8d9c

commit e39197c4403a082b1c607210211cb643830b8d9c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memmove on large data

    memcpy/memmove benchmarks with large data shows that there is a
    regression with large data on Haswell machine.  non-temporal store
    in memmove on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 4 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used.

    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.

        * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
        New.
        (init_cacheinfo): Set __x86_shared_non_temporal_threshold to
        4 times of shared cache size.
        * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
        (PREFETCHNT): New.
        (VMOVNT): Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
        (PREFETCHNT): Likewise.
        (VMOVNT): Likewise.
        * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
        (PREFETCHNT): Likewise.
        (VMOVNT): Likewise.
        (VMOVU): Changed to movups for smaller code sizes.
        (VMOVA): Changed to movaps for smaller code sizes.
        * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
        comments.  Rewrite to use forward and backward loops, which move
        4 vector registers at a time, to support overlapping addresses
        and use non temporal store if size is above the threshold.

-----------------------------------------------------------------------

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]