This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH 0/1] ARM: NEON optimized implementation of memcpy.


Hi,

This is an attempt to make an ARM NEON optimized memcpy function for glibc.
It should be well tuned for copying really large blocks. Handling of small
blocks may be probably tweaked a bit, though this setup seems to be
rather balanced overall.

Tarball with the test/benchmark program is attached. Inline patch will follow
in the next-email.

Feedback is welcome. I will be glad to provide replies with more details
and/or modify the patch to make it better.

The following benchmark was run beagleboard rev B7 (beagleboard.org)
ARM Cortex-A8 core running at 500MHz, NEON errata workaround enabled
(L1NEON bit), framebuffer disabled to save memory bandwidth:

--- Running correctness tests (use '-benchonly' option to skip) ---
all the correctness tests passed

--- Running benchmarks (average case/perfect alignment case) ---

very small data test:
memcpy_neon :  (3 bytes copy) =   78.0 MB/s /   80.2 MB/s
memcpy_arm  :  (3 bytes copy) =   70.7 MB/s /   72.4 MB/s
memcpy_neon :  (4 bytes copy) =   79.2 MB/s /   80.9 MB/s
memcpy_arm  :  (4 bytes copy) =   54.5 MB/s /   59.3 MB/s
memcpy_neon :  (5 bytes copy) =   99.0 MB/s /  101.2 MB/s
memcpy_arm  :  (5 bytes copy) =   61.3 MB/s /   74.1 MB/s
memcpy_neon :  (7 bytes copy) =  112.0 MB/s /  113.9 MB/s
memcpy_arm  :  (7 bytes copy) =   73.9 MB/s /  103.7 MB/s
memcpy_neon :  (8 bytes copy) =  107.0 MB/s /  108.8 MB/s
memcpy_arm  :  (8 bytes copy) =   84.3 MB/s /  107.4 MB/s
memcpy_neon :  (11 bytes copy) =  127.0 MB/s /  128.6 MB/s
memcpy_arm  :  (11 bytes copy) =  100.7 MB/s /  147.7 MB/s
memcpy_neon :  (12 bytes copy) =  121.6 MB/s /  123.0 MB/s
memcpy_arm  :  (12 bytes copy) =  111.2 MB/s /  145.4 MB/s
memcpy_neon :  (15 bytes copy) =  135.4 MB/s /  136.8 MB/s
memcpy_arm  :  (15 bytes copy) =  125.0 MB/s /  181.8 MB/s
memcpy_neon :  (16 bytes copy) =  126.8 MB/s /  220.9 MB/s
memcpy_arm  :  (16 bytes copy) =  134.0 MB/s /  176.8 MB/s
memcpy_neon :  (24 bytes copy) =  190.5 MB/s /  340.8 MB/s
memcpy_arm  :  (24 bytes copy) =  168.1 MB/s /  225.0 MB/s
memcpy_neon :  (31 bytes copy) =  249.5 MB/s /  416.3 MB/s
memcpy_arm  :  (31 bytes copy) =  194.9 MB/s /  270.4 MB/s

L1 cached data:
memcpy_neon :  (4096 bytes copy) = 1846.7 MB/s / 1956.7 MB/s
memcpy_arm  :  (4096 bytes copy) =  831.8 MB/s / 1398.5 MB/s
memcpy_neon :  (6144 bytes copy) = 1890.4 MB/s / 1965.5 MB/s
memcpy_arm  :  (6144 bytes copy) =  837.5 MB/s / 1403.1 MB/s

L2 cached data:
memcpy_neon :  (65536 bytes copy) =  669.8 MB/s /  743.7 MB/s
memcpy_arm  :  (65536 bytes copy) =  576.0 MB/s /  541.3 MB/s
memcpy_neon :  (98304 bytes copy) =  660.3 MB/s /  730.9 MB/s
memcpy_arm  :  (98304 bytes copy) =  562.1 MB/s /  536.4 MB/s

SDRAM:
memcpy_neon :  (2097152 bytes copy) =  368.8 MB/s /  368.3 MB/s
memcpy_arm  :  (2097152 bytes copy) =  214.4 MB/s /  240.3 MB/s
memcpy_neon :  (3145728 bytes copy) =  372.0 MB/s /  371.4 MB/s
memcpy_arm  :  (3145728 bytes copy) =  215.3 MB/s /  244.3 MB/s

(*) 1 MB = 1000000 bytes
(*) 'memcpy_arm' - an implementation for older ARM cores from glibc-ports

-- 
Best regards,
Siarhei Siamashka

Attachment: memcpy-neon.tar.gz
Description: application/tgz


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]