This is the mail archive of the
mailing list for the binutils project.
Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Janne Blomqvist <blomqvist dot janne at gmail dot com>
- Cc: Richard Henderson <rth at twiddle dot net>, Richard Biener <richard dot guenther at gmail dot com>, "H.J. Lu" <hjl dot tools at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>, GCC Development <gcc at gcc dot gnu dot org>, Binutils <binutils at sourceware dot org>, "Girkar, Milind" <milind dot girkar at intel dot com>, "Kreitzer, David L" <david dot l dot kreitzer at intel dot com>
- Date: Thu, 25 Jul 2013 14:47:07 +0200
- Subject: Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
- References: <CAMe9rOrvMxSLj3LcYBs71tVdw6C0vJFKD2HxvnoHc13UamftwA at mail dot gmail dot com> <ddab98c2-bb3b-4d02-b403-e7d5690cfe00 at email dot android dot com> <51F01C0A dot 5050101 at twiddle dot net> <20130724185233 dot GA12562 at domone dot kolej dot mff dot cuni dot cz> <CAO9iq9FS-qiytN5fPNRr3WjZRwsR5q3Me_k4AVkRBnRMwSYsHg at mail dot gmail dot com>
On Thu, Jul 25, 2013 at 03:17:43PM +0300, Janne Blomqvist wrote:
> On Wed, Jul 24, 2013 at 9:52 PM, OndÅej BÃlka <email@example.com> wrote:
> > On Wed, Jul 24, 2013 at 08:25:14AM -1000, Richard Henderson wrote:
> >> On 07/24/2013 05:23 AM, Richard Biener wrote:
> >> > "H.J. Lu" <firstname.lastname@example.org> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> Here is a patch to extend x86-64 psABI to support AVX-512:
> >> >
> >> > Afaik avx 512 doubles the amount of xmm registers. Can we get them callee saved please?
> >> Having them callee saved pre-supposes that one knows the width of the register.
> >> There's room in the instruction set for avx1024. Does anyone believe that is
> >> not going to appear in the next few years?
> > It would be mistake for intel to focus on avx1024. You hit diminishing
> > returns and only few workloads would utilize loading 128 bytes at once.
> > Problem with vectorization is that it becomes memory bound so you will
> > not got much because performance is dominated by cache throughput.
> > You would get bigger speedup from more effective pipelining, more
> > fusion...
> ISTR that one of the main reason "long" vector ISA's did so well on
> some workloads was not that the vector length was big, per se, but
> rather that the scatter/gather instructions these ISA's typically have
> allowed them to extract much more parallelism from the memory
> subsystem. The typical example being sparse matrix style problems, but
> I suppose other types of problems with indirect accesses could benefit
> as well. Deeper OoO buffers would in principle allow the same memory
> level parallelism extraction, but those apparently have quite steep
> power and silicon area cost scaling (O(n**2) or maybe even O(n**3)),
> making really deep buffers impractical.
> And, IIRC scatter/gather instructions are featured as of some
> recent-ish AVX-something version. That being said, maybe current
> cache-based memory subsystems are different enough from the vector
> supercomputers of yore that the above doesn't hold to the same extent
Also this depends how many details intel got right. One example is
pmovmsk instruction. It is trivial to implement in silicon and gives
advantage over other architectures.
When a problem is 'find elements in array that satisfy some expression'
then without pmovmsk or equivalent finding what changed is relatively expensive.
One problem is that depending on profile you may spend majority of time
for small sizes. So you need to have effective branches for these sizes
(gcc does not handle that well yet). Then you get problem that it
increases icache pressure.
Then another problem is that you often could benefit from vector
instructions if you could read/write more memory. Reading can be done
inexpensively by checking if it crosses page, writing data is problem
and so we do a suboptimal path just to write only data that changed.
This could also be solved technologically if a masked move instruction
could encode only to memory accesses that changed and thus avoid
possible race conditions in unchanged parts.
> Janne Blomqvist