This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug libc/18185] Wrong processor count for L2 cache sharing on Silvermont and Knights Landing


https://sourceware.org/bugzilla/show_bug.cgi?id=18185

--- Comment #5 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, hjl/erms/2.22 has been created
        at  b60dda5f2385aaca873069f9fb28645b82a1b711 (commit)

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=b60dda5f2385aaca873069f9fb28645b82a1b711

commit b60dda5f2385aaca873069f9fb28645b82a1b711
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 27 15:16:22 2016 -0700

    Count number of logical processors sharing L2 cache

    For Intel processors, when there are both L2 and L3 caches, SMT level
    type should be ued to count number of available logical processors
    sharing L2 cache.  If there is only L2 cache, core level type should
    be used to count number of available logical processors sharing L2
    cache.  Number of available logical processors sharing L2 cache should
    be used for non-inclusive L2 and L3 caches.

        * sysdeps/x86/cacheinfo.c (init_cacheinfo): Count number of
        available logical processors with SMT level type sharing L2
        cache for Intel processors.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ed46697862f2b0c2db726cc4c772e6003914bd72

commit ed46697862f2b0c2db726cc4c772e6003914bd72
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 20 14:41:14 2016 -0700

    Remove special L2 cache case for Knights Landing

    L2 cache is shared by 2 cores on Knights Landing, which has 4 threads
    per core:

    https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing

    So L2 cache is shared by 8 threads on Knights Landing as reported by
    CPUID.  We should remove special L2 cache case for Knights Landing.

        [BZ #18185]
        * sysdeps/x86/cacheinfo.c (init_cacheinfo): Don't limit threads
        sharing L2 cache to 2 for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=07f943915311f6f92e5a031911d32c5e7458bfd5

commit 07f943915311f6f92e5a031911d32c5e7458bfd5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu May 19 10:02:36 2016 -0700

    Correct Intel processor level type mask from CPUID

    Intel CPUID with EAX == 11 returns:

    ECX Bits 07 - 00: Level number. Same value in ECX input.
        Bits 15 - 08: Level type.
        ^^^^^^^^^^^^^^^^^^^^^^^^ This is level type.
        Bits 31 - 16: Reserved.

    Intel processor level type mask should be 0xff00, not 0xff0.

        [BZ #20119]
        * sysdeps/x86/cacheinfo.c (init_cacheinfo): Correct Intel
        processor level type mask for CPUID with EAX == 11.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=201aebf739482fbb730d10eb7cf8335629bb4de4

commit 201aebf739482fbb730d10eb7cf8335629bb4de4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu May 19 09:09:00 2016 -0700

    Check the HTT bit before counting logical threads

    Skip counting logical threads for Intel processors if the HTT bit is 0
    which indicates there is only a single logical processor.

        * sysdeps/x86/cacheinfo.c (init_cacheinfo): Skip counting
        logical threads if the HTT bit is 0.
        * sysdeps/x86/cpu-features.h (bit_cpu_HTT): New.
        (index_cpu_HTT): Likewise.
        (reg_HTT): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=dff8bcdab5968ac53e52ef06cabe8d921b429d22

commit dff8bcdab5968ac53e52ef06cabe8d921b429d22
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu May 19 08:49:45 2016 -0700

    Remove alignments on jump targets in memset

    X86-64 memset-vec-unaligned-erms.S aligns many jump targets, which
    increases code sizes, but not necessarily improve performance.  As
    memset benchtest data of align vs no align on various Intel and AMD
    processors

    https://sourceware.org/bugzilla/attachment.cgi?id=9277

    shows that aligning jump targets isn't necessary.

        [BZ #20115]
        * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset):
        Remove alignments on jump targets.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=aba9d000bf8441d77f0557af360e3aea7525d03e

commit aba9d000bf8441d77f0557af360e3aea7525d03e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 13 08:29:22 2016 -0700

    Call init_cpu_features only if SHARED is defined

    In static executable, since init_cpu_features is called early from
    __libc_start_main, there is no need to call it again in dl_platform_init.

        [BZ #20072]
        * sysdeps/i386/dl-machine.h (dl_platform_init): Call
        init_cpu_features only if SHARED is defined.
        * sysdeps/x86_64/dl-machine.h (dl_platform_init): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=6118b2d23016ec790b99b9331c3d7a45d588134e

commit 6118b2d23016ec790b99b9331c3d7a45d588134e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 13 07:18:25 2016 -0700

    Support non-inclusive caches on Intel processors

        * sysdeps/x86/cacheinfo.c (init_cacheinfo): Check and support
        non-inclusive caches on Intel processors.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=8642c9a553d8ce8a3a0496ed11fed5a575d338c5

commit 8642c9a553d8ce8a3a0496ed11fed5a575d338c5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed May 11 05:49:09 2016 -0700

    Remove x86 ifunc-defines.sym and rtld-global-offsets.sym

    Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym.  Remove
    x86 ifunc-defines.sym and rtld-global-offsets.sym.  No code changes on
    i686 and x86-64.

        * sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers):
        Remove ifunc-defines.sym.
        * sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers):
        Likewise.
        * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed.
        * sysdeps/x86/rtld-global-offsets.sym: Likewise.
        * sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise.
        * sysdeps/x86/Makefile (gen-as-const-headers): Remove
        rtld-global-offsets.sym.
        * sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ...
        * sysdeps/x86/cpu-features-offsets.sym: This.
        * sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h>
        instead of <ifunc-defines.h> and <rtld-global-offsets.h>.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3038902f233a5e0028a6424685b410f6c201040f

commit 3038902f233a5e0028a6424685b410f6c201040f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun May 8 08:49:02 2016 -0700

    Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86

    Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86.  No code changes on x86
    and x86_64.

        * sysdeps/i386/cacheinfo.c: Include <sysdeps/x86/cacheinfo.c>
        instead of <sysdeps/x86_64/cacheinfo.c>.
        * sysdeps/x86_64/cacheinfo.c: Moved to ...
        * sysdeps/x86/cacheinfo.c: Here.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=df2b390bba18903d62c8910e808bfb0dce7f033c

commit df2b390bba18903d62c8910e808bfb0dce7f033c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 15 05:22:53 2016 -0700

    Detect Intel Goldmont and Airmont processors

    Updated from the model numbers of Goldmont and Airmont processors in
    Intel64 And IA-32 Processor Architectures Software Developer's Manual
    Volume 3 Revision 058.

        * sysdeps/x86/cpu-features.c (init_cpu_features): Detect Intel
        Goldmont and Airmont processors.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=157c57198e893b4882d1feb98de2b0721ee408fc

commit 157c57198e893b4882d1feb98de2b0721ee408fc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 14:01:24 2016 -0700

    X86-64: Add dummy memcopy.h and wordcopy.c

    Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and
    wordcopy.c to reduce code size.  It reduces the size of libc.so by about
    1 KB.

        * sysdeps/x86_64/memcopy.h: New file.
        * sysdeps/x86_64/wordcopy.c: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f817b9d36215ab60d58cc744d22773b4961a2c9b

commit f817b9d36215ab60d58cc744d22773b4961a2c9b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 12:46:57 2016 -0700

    X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove

    Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
    we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
    the new ones.

    No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
    before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
    memcpy/memmove optimized with Enhanced REP MOVSB will be used for
    processors with ERMS.  The new AVX512 memcpy/memmove will be used for
    processors with AVX512 which prefer vzeroupper.

    Since the new SSE2 memcpy/memmove are faster than the previous default
    memcpy/memmove used in libc.a and ld.so, we also remove the previous
    default memcpy/memmove and make them the default memcpy/memmove, except
    that non-temporal store isn't used in ld.so.

    Together, it reduces the size of libc.so by about 6 KB and the size of
    ld.so by about 2 KB.

        [BZ #19776]
        * sysdeps/x86_64/memcpy.S: Make it dummy.
        * sysdeps/x86_64/mempcpy.S: Likewise.
        * sysdeps/x86_64/memmove.S: New file.
        * sysdeps/x86_64/memmove_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
        * sysdeps/x86_64/memmove.c: Removed.
        * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
        Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
        memcpy-sse2-unaligned, memmove-avx-unaligned,
        memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c
        (__libc_ifunc_impl_list): Replace
        __memmove_chk_avx512_unaligned_2 with
        __memmove_chk_avx512_unaligned.  Remove
        __memmove_chk_avx_unaligned_2.  Replace
        __memmove_chk_sse2_unaligned_2 with
        __memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
        __memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
        with __memmove_avx512_unaligned.  Replace
        __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
        Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
        with __memcpy_chk_avx512_unaligned.  Remove
        __memcpy_chk_avx_unaligned_2.  Replace
        __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
        Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
        Replace __memcpy_avx512_unaligned_2 with
        __memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
        and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
        with __mempcpy_chk_avx512_unaligned.  Remove
        __mempcpy_chk_avx_unaligned_2.  Replace
        __mempcpy_chk_sse2_unaligned_2 with
        __mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
        Replace __mempcpy_avx512_unaligned_2 with
        __mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
        Replace __mempcpy_sse2_unaligned_2 with
        __mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
        * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
        __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
        Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
        if processor has ERMS.  Default to __memcpy_sse2_unaligned.
        (ENTRY): Removed.
        (END): Likewise.
        (ENTRY_CHK): Likewise.
        (libc_hidden_builtin_def): Likewise.
        Don't include ../memcpy.S.
        * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
        __memcpy_chk_avx512_unaligned_erms and
        __memcpy_chk_avx512_unaligned.  Use
        __memcpy_chk_avx_unaligned_erms and
        __memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
        Default to __memcpy_chk_sse2_unaligned.
        * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
        Change function suffix from unaligned_2 to unaligned.
        * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
        __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
        Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
        if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
        (ENTRY): Removed.
        (END): Likewise.
        (ENTRY_CHK): Likewise.
        (libc_hidden_builtin_def): Likewise.
        Don't include ../mempcpy.S.
        (mempcpy): New.  Add a weak alias.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
        __mempcpy_chk_avx512_unaligned_erms and
        __mempcpy_chk_avx512_unaligned.  Use
        __mempcpy_chk_avx_unaligned_erms and
        __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
        Default to __mempcpy_chk_sse2_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=122600f4b380b00ce0f682039fe59af4bd0edbc0

commit 122600f4b380b00ce0f682039fe59af4bd0edbc0
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:42:30 2016 -0700

    X86-64: Remove the previous SSE2/AVX2 memsets

    Since the new SSE2/AVX2 memsets are faster than the previous ones, we
    can remove the previous SSE2/AVX2 memsets and replace them with the
    new ones.  This reduces the size of libc.so by about 900 bytes.

    No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
    before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
    optimized with Enhanced REP STOSB will be used for processors with
    ERMS.  The new AVX512 memset will be used for processors with AVX512
    which prefer vzeroupper.

        [BZ #19881]
        * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
        into ...
        * sysdeps/x86_64/memset.S: This.
        (__bzero): Removed.
        (__memset_tail): Likewise.
        (__memset_chk): Likewise.
        (memset): Likewise.
        (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
        defined.
        (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
        * sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
        (__memset_zero_constant_len_parameter): Check SHARED instead of
        PIC.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
        memset-avx2 and memset-sse2-unaligned-erms.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c
        (__libc_ifunc_impl_list): Remove __memset_chk_sse2,
        __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
        * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
        (__bzero): Enabled.
        * sysdeps/x86_64/multiarch/memset.S (memset): Replace
        __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
        and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
        or __memset_avx2_unaligned_erms if processor has ERMS.  Support
        __memset_avx512_unaligned_erms and __memset_avx512_unaligned.
        (memset): Removed.
        (__memset_chk): Likewise.
        (MEMSET_SYMBOL): New.
        (libc_hidden_builtin_def): Replace __memset_sse2 with
        __memset_sse2_unaligned.
        * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
        __memset_chk_sse2 and __memset_chk_avx2 with
        __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
        Use __memset_chk_sse2_unaligned_erms or
        __memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
        __memset_chk_avx512_unaligned_erms and
        __memset_chk_avx512_unaligned.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0ee4375cef69e00e69ddb1d08fe0d492053208f3

commit 0ee4375cef69e00e69ddb1d08fe0d492053208f3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 17:21:45 2016 -0700

    X86-64: Use non-temporal store in memcpy on large data

    The large memcpy micro benchmark in glibc shows that there is a
    regression with large data on Haswell machine.  non-temporal store in
    memcpy on large data can improve performance significantly.  This
    patch adds a threshold to use non temporal store which is 6 times of
    shared cache size.  When size is above the threshold, non temporal
    store will be used, but avoid non-temporal store if there is overlap
    between destination and source since destination may be in cache when
    source is loaded.

    For size below 8 vector register width, we load all data into registers
    and store them together.  Only forward and backward loops, which move 4
    vector registers at a time, are used to support overlapping addresses.
    For forward loop, we load the last 4 vector register width of data and
    the first vector register width of data into vector registers before the
    loop and store them after the loop.  For backward loop, we load the first
    4 vector register width of data and the last vector register width of
    data into vector registers before the loop and store them after the loop.

        [BZ #19928]
        * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
        New.
        (init_cacheinfo): Set __x86_shared_non_temporal_threshold to
        6 times of shared cache size.
        * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
        (VMOVNT): New.
        * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
        (VMOVNT): Likewise.
        * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
        (VMOVNT): Likewise.
        (VMOVU): Changed to movups for smaller code sizes.
        (VMOVA): Changed to movaps for smaller code sizes.
        * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
        comments.
        (PREFETCH): New.
        (PREFETCH_SIZE): Likewise.
        (PREFETCHED_LOAD_SIZE): Likewise.
        (PREFETCH_ONE_SET): Likewise.
        Rewrite to use forward and backward loops, which move 4 vector
        registers at a time, to support overlapping addresses and use
        non temporal store if size is above the threshold and there is
        no overlap between destination and source.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=54667f64fa4074325ee33e487c033c313ce95067

commit 54667f64fa4074325ee33e487c033c313ce95067
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 10:19:16 2016 -0700

    X86-64: Prepare memmove-vec-unaligned-erms.S

    Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
    default memcpy, mempcpy and memmove.

        * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
        (MEMCPY_SYMBOL): New.
        (MEMPCPY_SYMBOL): Likewise.
        (MEMMOVE_CHK_SYMBOL): Likewise.
        Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
        symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
        __mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
        Provide alias for memcpy in libc.a and ld.so.

    (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=68a0b487e274b3452a1660e4b9fad5df8d8c0284

commit 68a0b487e274b3452a1660e4b9fad5df8d8c0284
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Apr 6 09:10:18 2016 -0700

    X86-64: Prepare memset-vec-unaligned-erms.S

    Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
    default memset.

        * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
        (MEMSET_CHK_SYMBOL): New.  Define if not defined.
        (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
        Disabled fro now.
        Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
        symbols.  Properly check USE_MULTIARCH on __memset symbols.

    (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c2d3bdd6aec639fd66fceb3e2c145420c25d409b

commit c2d3bdd6aec639fd66fceb3e2c145420c25d409b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:21:07 2016 -0700

    Force 32-bit displacement in memset-vec-unaligned-erms.S

        * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
        32-bit displacement to avoid long nop between instructions.

    (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=070a5e77d66f5520c1bbbc24dc1843a0a1c161ee

commit 070a5e77d66f5520c1bbbc24dc1843a0a1c161ee
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Apr 5 05:19:05 2016 -0700

    Add a comment in memset-sse2-unaligned-erms.S

        * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
        a comment on VMOVU and VMOVA.

    (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7e00bb9720268f142668d22e91dff7c3e6e0c08c

commit 7e00bb9720268f142668d22e91dff7c3e6e0c08c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 14:32:20 2016 -0700

    Don't put SSE2/AVX/AVX512 memmove/memset in ld.so

    Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
    and AVX512 memmove and memset in ld.so.

        * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
        if not in libc.
        * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
        Likewise.
        * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
        Likewise.
        * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
        Likewise.

    (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1e57539f5dbdefc96a85021b611863eaa28dd13

commit e1e57539f5dbdefc96a85021b611863eaa28dd13
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Apr 3 12:38:25 2016 -0700

    Fix memmove-vec-unaligned-erms.S

    __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
    and __memmove it breaks __memmove_chk.

    Don't check source == destination first since it is less common.

        * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
        (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
        with unaligned_erms.
        (__memmove_erms): Skip if source == destination.
        (__memmove_unaligned_erms): Don't check source == destination
        first.

    (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a13ac6b5ced68aadb7c1546102445f9c57f43231

commit a13ac6b5ced68aadb7c1546102445f9c57f43231
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 08:23:24 2016 -0800

    Use HAS_ARCH_FEATURE with Fast_Rep_String

    HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with
    Fast_Rep_String.

        [BZ #19762]
        * sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use
        HAS_ARCH_FEATURE with Fast_Rep_String.
        * sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise.
        * sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise.
        * sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk):
        Likewise.
        * sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk):
        Likewise.
        * sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise.
        * sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk):
        Likewise.
        * sysdeps/i386/i686/multiarch/memset.S (memset): Likewise.
        * sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk):
        Likewise.

    (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4ad4d58ed7a444e2d9787113fce132a99b35b417

commit 4ad4d58ed7a444e2d9787113fce132a99b35b417
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 1 15:08:48 2016 -0700

    Remove Fast_Copy_Backward from Intel Core processors

    Intel Core i3, i5 and i7 processors have fast unaligned copy and
    copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
    processors to avoid confusion.

        * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
        bit_arch_Fast_Copy_Backward for Intel Core proessors.

    (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a304f3933c7f8347e49057a7a315cbd571662ff7

commit a304f3933c7f8347e49057a7a315cbd571662ff7
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:05:51 2016 -0700

    Add x86-64 memset with unaligned store and rep stosb

    Implement x86-64 memset with unaligned store and rep movsb.  Support
    16-byte, 32-byte and 64-byte vector register sizes.  A single file
    provides 2 implementations of memset, one with rep stosb and the other
    without rep stosb.  They share the same codes when size is between 2
    times of vector register size and REP_STOSB_THRESHOLD which defaults
    to 2KB.

    Key features:

    1. Use overlapping store to avoid branch.
    2. For size <= 4 times of vector register size, fully unroll the loop.
    3. For size > 4 times of vector register size, store 4 times of vector
    register size at a time.

        [BZ #19881]
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
        memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
        memset-avx512-unaligned-erms.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c
        (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
        __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
        __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
        __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
        __memset_sse2_unaligned_erms, __memset_erms,
        __memset_avx2_unaligned, __memset_avx2_unaligned_erms,
        __memset_avx512_unaligned_erms and __memset_avx512_unaligned.
        * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
        file.
        * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
        Likewise.
        * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
        Likewise.
        * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
        Likewise.

    (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e

commit 1c5d0b0ba41376c2f0792da4f22cc1f5b2b2688e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 31 10:04:26 2016 -0700

    Add x86-64 memmove with unaligned load/store and rep movsb

    Implement x86-64 memmove with unaligned load/store and rep movsb.
    Support 16-byte, 32-byte and 64-byte vector register sizes.  When
    size <= 8 times of vector register size, there is no check for
    address overlap bewteen source and destination.  Since overhead for
    overlap check is small when size > 8 times of vector register size,
    memcpy is an alias of memmove.

    A single file provides 2 implementations of memmove, one with rep movsb
    and the other without rep movsb.  They share the same codes when size is
    between 2 times of vector register size and REP_MOVSB_THRESHOLD which
    is 2KB for 16-byte vector register size and scaled up by large vector
    register size.

    Key features:

    1. Use overlapping load and store to avoid branch.
    2. For size <= 8 times of vector register size, load  all sources into
    registers and store them together.
    3. If there is no address overlap bewteen source and destination, copy
    from both ends with 4 times of vector register size at a time.
    4. If address of destination > address of source, backward copy 8 times
    of vector register size at a time.
    5. Otherwise, forward copy 8 times of vector register size at a time.
    6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
    by fallbacking to backward copy 8 times of vector register size at a
    time.
    7. Skip when address of destination == address of source.

        [BZ #19776]
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
        memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
        memmove-avx512-unaligned-erms.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c
        (__libc_ifunc_impl_list): Test
        __memmove_chk_avx512_unaligned_2,
        __memmove_chk_avx512_unaligned_erms,
        __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
        __memmove_chk_sse2_unaligned_2,
        __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
        __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
        __memmove_avx512_unaligned_erms, __memmove_erms,
        __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
        __memcpy_chk_avx512_unaligned_2,
        __memcpy_chk_avx512_unaligned_erms,
        __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
        __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
        __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
        __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
        __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
        __memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
        __mempcpy_chk_avx512_unaligned_erms,
        __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
        __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
        __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
        __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
        __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
        __mempcpy_erms.
        * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
        file.
        * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
        Likwise.
        * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
        Likwise.
        * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
        Likwise.

    (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=e1203f48239fbb9832db6ed3a0d2a008e317aff9

commit e1203f48239fbb9832db6ed3a0d2a008e317aff9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 19:22:59 2016 -0700

    Initial Enhanced REP MOVSB/STOSB (ERMS) support

    The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
    has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
    (ERMS) bit to x86 cpu-features.

        * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
        (index_cpu_ERMS): Likewise.
        (reg_ERMS): Likewise.

    (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3597d65be2a44f063ef12bb907fdad8567aa3e6a

commit 3597d65be2a44f063ef12bb907fdad8567aa3e6a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:15:59 2016 -0700

    Make __memcpy_avx512_no_vzeroupper an alias

    Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
    __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
    to reduce code size of libc.so.

        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
        memcpy-avx512-no-vzeroupper.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
        to ...
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
        (MEMCPY): Don't define.
        (MEMCPY_CHK): Likewise.
        (MEMPCPY): Likewise.
        (MEMPCPY_CHK): Likewise.
        (MEMPCPY_CHK): Renamed to ...
        (__mempcpy_chk_avx512_no_vzeroupper): This.
        (MEMPCPY_CHK): Renamed to ...
        (__mempcpy_chk_avx512_no_vzeroupper): This.
        (MEMCPY_CHK): Renamed to ...
        (__memmove_chk_avx512_no_vzeroupper): This.
        (MEMCPY): Renamed to ...
        (__memmove_avx512_no_vzeroupper): This.
        (__memcpy_avx512_no_vzeroupper): New alias.
        (__memcpy_chk_avx512_no_vzeroupper): Likewise.

    (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=9fbaf0f27a11deb98df79d04adee97aebee78d40

commit 9fbaf0f27a11deb98df79d04adee97aebee78d40
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy

    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.

        [BZ #18858]
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
        mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
        and mempcpy-avx512-no-vzeroupper.
        * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
        New.
        (MEMPCPY): Likewise.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
        (MEMPCPY_CHK): New.
        (MEMPCPY): Likewise.
        * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
        (MEMPCPY): Likewise.
        * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
        (MEMPCPY): Likewise.
        * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
        Likewise.
        * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.

    (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5239cb481eea27650173b9b9af22439afdcbf358

commit 5239cb481eea27650173b9b9af22439afdcbf358
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 04:39:48 2016 -0700

    [x86] Add a feature bit: Fast_Unaligned_Copy

    On AMD processors, memcpy optimized with unaligned SSE load is
    slower than emcpy optimized with aligned SSSE3 while other string
    functions are faster with unaligned SSE load.  A feature bit,
    Fast_Unaligned_Copy, is added to select memcpy optimized with
    unaligned SSE load.

        [BZ #19583]
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set
        Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
        processors.  Set Fast_Copy_Backward for AMD Excavator
        processors.
        * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
        New.
        (index_arch_Fast_Unaligned_Copy): Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
        Fast_Unaligned_Copy instead of Fast_Unaligned_Load.

    (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a65b3d13e1754d568782e64a762c2c7fab45a55d

commit a65b3d13e1754d568782e64a762c2c7fab45a55d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 08:36:16 2016 -0700

    Don't set %rcx twice before "rep movsb"

        * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
        Don't set %rcx twice before "rep movsb".

    (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4b6d20366aac66070f1cf50552cf2951991a1e5

commit f4b6d20366aac66070f1cf50552cf2951991a1e5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Mar 22 07:46:56 2016 -0700

    Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors

    Since only Intel processors with AVX2 have fast unaligned load, we
    should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.

    Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
    and call get_common_indeces for other processors.

    Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
    GLRO(dl_x86_cpu_features) in cpu-features.c.

        [BZ #19583]
        * sysdeps/x86/cpu-features.c (get_common_indeces): Remove
        inline.  Check family before setting family, model and
        extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
        bits here.
        (init_cpu_features): Replace HAS_CPU_FEATURE and
        HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
        CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
        for Intel processors with usable AVX2.  Call get_common_indeces
        for other processors with family == NULL.
        * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
        (CPU_FEATURES_ARCH_P): Likewise.
        (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
        (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.

    (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ca9c5edeea52dc18f42ebbe29b1af352f5555538

commit ca9c5edeea52dc18f42ebbe29b1af352f5555538
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Nov 30 08:53:37 2015 -0800

    Update family and model detection for AMD CPUs

    AMD CPUs uses the similar encoding scheme for extended family and model
    as Intel CPUs as shown in:

    http://support.amd.com/TechDocs/25481.pdf

    This patch updates get_common_indeces to get family and model for both
    Intel and AMD CPUs when family == 0x0f.

        [BZ #19214]
        * sysdeps/x86/cpu-features.c (get_common_indeces): Add an
        argument to return extended model.  Update family and model
        with extended family and model when family == 0x0f.
        (init_cpu_features): Updated.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c23cdbac4ea473effbef5c50b1217f95595b3460

commit c23cdbac4ea473effbef5c50b1217f95595b3460
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 10 05:26:46 2016 -0800

    Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.h

    index_* and bit_* macros are used to access cpuid and feature arrays o
    struct cpu_features.  It is very easy to use bits and indices of cpuid
    array on feature array, especially in assembly codes.  For example,
    sysdeps/i386/i686/multiarch/bcopy.S has

        HAS_CPU_FEATURE (Fast_Rep_String)

    which should be

        HAS_ARCH_FEATURE (Fast_Rep_String)

    We change index_* and bit_* to index_cpu_*/index_arch_* and
    bit_cpu_*/bit_arch_* so that we can catch such error at build time.

        [BZ #19762]
        * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
        (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
        * sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
        (bit_arch_*): This for feature array.
        (bit_*): Renamed to ...
        (bit_cpu_*): This for cpu array.
        (index_*): Renamed to ...
        (index_arch_*): This for feature array.
        (index_*): Renamed to ...
        (index_cpu_*): This for cpu array.
        [__ASSEMBLER__] (HAS_FEATURE): Add and use field.
        [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
        [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
        [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
        bit_##name with index_cpu_##name and bit_cpu_##name.
        [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
        bit_##name with index_arch_##name and bit_arch_##name.

    (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a49c82956f5a42a2cce22c2e97360de1b32301d

commit 4a49c82956f5a42a2cce22c2e97360de1b32301d
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Mar 3 14:51:40 2016 -0800

    Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS

    We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without
    overriding other bits.

        [BZ #19758]
        * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
        (EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=18f8c0e3cc9ff7b092f02c9b42874a5439347bbc

commit 18f8c0e3cc9ff7b092f02c9b42874a5439347bbc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 6 16:48:11 2016 -0800

    Group AVX512 functions in .text.avx512 section

        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
        Replace .text with .text.avx512.
        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
        Likewise.

    (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c8e297a186f844ebb7eba7a3bc0343c83615ca9

commit 0c8e297a186f844ebb7eba7a3bc0343c83615ca9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 4 08:37:40 2016 -0800

    x86-64: Fix memcpy IFUNC selection

    Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
    Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
    order is updated with following selection order:

    1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
    2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
    3. __memcpy_sse2 if SSSE3 isn't available.
    4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
    5. __memcpy_ssse3

        [BZ #18880]
        * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
        instead of Slow_BSF, and also check for Fast_Copy_Backward to
        enable __memcpy_ssse3_back.

    (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3c772cb4d9cbe19cd97ad991e3dab43014198c44

commit 3c772cb4d9cbe19cd97ad991e3dab43014198c44
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Jan 16 00:49:45 2016 +0300

    Added memcpy/memmove family optimized with AVX512 for KNL hardware.

    Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
    mempcpy_chk, memmove_chk.
    It shows average improvement more than 30% over AVX versions on KNL
    hardware (performance results in the thread
    <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).

        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
        * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
        * sysdeps/x86_64/multiarch/memmove.c: Likewise.
        * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
        * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2

commit 7f20e52f1f8d7709140f4bdf828a6bb0f0f08af2
Author: Andrew Senkevich <andrew.senkevich@intel.com>
Date:   Sat Dec 19 02:47:28 2015 +0300

    Added memset optimized with AVX512 for KNL hardware.

    It shows improvement up to 28% over AVX2 memset (performance results
    attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).

        * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
        * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
        * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
        * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
        * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
        index_Prefer_No_VZEROUPPER): New.
        * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
        Prefer_No_VZEROUPPER for Knights Landing.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d530cd5463701a59ed923d53a97d3b534fdfea8a

commit d530cd5463701a59ed923d53a97d3b534fdfea8a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 21 14:44:23 2015 -0700

    Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT

    According to Silvermont software optimization guide, for 64-bit
    applications, branch prediction performance can be negatively impacted
    when the target of a branch is more than 4GB away from the branch.  Add
    the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
    pages with MAP_32BIT first.  NB: MAP_32BIT will map to lower 2GB, not
    lower 4GB, address.  Prefer_MAP_32BIT_EXEC reduces bits available for
    address space layout randomization (ASLR), which is always disabled for
    SUID programs and can only be enabled by setting environment variable,
    LD_PREFER_MAP_32BIT_EXEC.

    On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.

        [BZ #19367]
        * sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
        * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
        * sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
        * sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
        (index_Prefer_MAP_32BIT_EXEC): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=fe24aedc3530037d7bb614b84d309e6b816686bf

commit fe24aedc3530037d7bb614b84d309e6b816686bf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Dec 15 11:46:54 2015 -0800

    Enable Silvermont optimizations for Knights Landing

    Knights Landing processor is based on Silvermont.  This patch enables
    Silvermont optimizations for Knights Landing.

        * sysdeps/x86/cpu-features.c (init_cpu_features): Enable
        Silvermont optimizations for Knights Landing.

-----------------------------------------------------------------------

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]