This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH][AArch64] Enable _STRING_ARCH_unaligned
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Wilco Dijkstra <wdijkstr at arm dot com>
- Cc: 'Andrew Pinski' <pinskia at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Thu, 27 Aug 2015 08:49:45 +0200
- Subject: Re: [PATCH][AArch64] Enable _STRING_ARCH_unaligned
- Authentication-results: sourceware.org; auth=none
- References: <000101d0db53$e96233c0$bc269b40$ at com> <CA+=Sn1mEcUHtP+1JkOy+7JU6LvbDrbZDLuSEcD89GQF2OpuKDQ at mail dot gmail dot com> <000301d0db65$5c7648e0$1562daa0$ at com>
On Thu, Aug 20, 2015 at 05:29:18PM +0100, Wilco Dijkstra wrote:
> > Andrew Pinski wrote:
> > On Thu, Aug 20, 2015 at 10:24 PM, Wilco Dijkstra <firstname.lastname@example.org> wrote:
> > > +
> > > +/* AArch64 implementations support efficient unaligned access. */
> > > +#define _STRING_ARCH_unaligned 1
> > I don't think this is 100% true. On ThunderX, an unaligned store or
> > load takes an extra 8 cycles (a full pipeline flush) as all unaligned
> > load/stores have to be replayed.
> > I think we should also benchmark there to find out if this is a win
> > because I doubt it is a win but I could be proved wrong.
> That's bad indeed, but it would still be better than doing everything
> one byte at a time. Eg. resolv/arpa/nameser.h does:
There are two things, one is if unaligned loads are possible, second is
if they are fast. If they are not then you will end to emulate them with
aligned loads and shifts to emulate them. And that you still should use
unaligned load in headers of function as latency matters and chain to
create unaligned vector has high latency.
> > Are there benchmarks for each of the uses of _STRING_ARCH_unaligned
> > so I can do the benchmarking on ThunderX?
> I don't believe there are.
There is also matter that future optimizations could rely on this so its
better to stay consistent.
> > Also I don't see any benchmark results even for any of the other
> > AARCH64 processors.
> It's obvious it is a big on most of the uses of _STRING_ARCH_unaligned.
> Eg. consider the encryption code in crypt/md5.c:
> #if !_STRING_ARCH_unaligned
> if (UNALIGNED_P (buffer))
> while (len > 64)
> __md5_process_block (memcpy (ctx->buffer, buffer, 64), 64, ctx);
> buffer = (const char *) buffer + 64;
> len -= 64;
> So basically you end up doing an extra memcpy if unaligned access is not
> supported. This means you'll not only do the unaligned loads anyway, but
> you'll also do an extra aligned load and store to the buffer.
> GLIBC use of _STRING_ARCH_unaligned is quite messy and would benefit from
> a major cleanup, however it's quite clear enabling this is a win on overall.
Correct, I se several other optimization opportunities that would
benefit from it.