This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH][AArch64] Optimized memset
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Wilco Dijkstra <wdijkstr at arm dot com>
- Cc: 'GNU C Library' <libc-alpha at sourceware dot org>
- Date: Tue, 11 Aug 2015 14:23:48 +0200
- Subject: Re: [PATCH][AArch64] Optimized memset
- Authentication-results: sourceware.org; auth=none
- References: <004c01d0cba1$e15ac5a0$a41050e0$ at com>
On Fri, Jul 31, 2015 at 04:02:12PM +0100, Wilco Dijkstra wrote:
> This is an optimized memset for AArch64. Memset is split into 4 main cases: small sets of up to 16
> bytes, medium of 16..96 bytes which are fully unrolled. Large memsets of more than 96 bytes align
> the destination and use an unrolled loop processing 64 bytes per iteration. Memsets of zero of more
> than 256 use the dc zva instruction, and there are faster versions for the common ZVA sizes 64 or
> 128. STP of Q registers is used to reduce codesize without loss of performance.
> Speedup on test-memset is 1% on Cortex-A57 and 8% on Cortex-A53. On a random test with varying sizes
> and alignment the new version is 50% faster.
> OK for commit?
A strategy for smaller sizes is quite similar to one on x64. Could you
comment why did you choose this control flow. It isn't clear where you
should stop with full unrolling, I recall that with some gcc majority of
calls had size 192 so unrolling to 256 bytes obviously gave speedup.
I also got some ideas to handle small case with conditional moves/
masked moves, as aarch64 doesn't have conditional move only select
would it be possible to handle small case by
address4 = (size & 4) ? address : stack;
*((int32_t *) address4) = vc;
address2 = (size & 2) ? address + size - 2: stack;
*((int16_t *) address2) = vc;
address1 = (size & 1) ? address + (size & 4): stack;
*((char *) address2) = vc;
I didn't tested if it makes improvement but it looks likely.
A real performance impact of this is tricky as it heavily depends on
what caller does so only definitive way is take programs that use it
(like gcc) and run overnight test to see if you get 1% improvement in
total running time or not.
Here I would also be interested how this will be improved on dryrun