This is the mail archive of the
mailing list for the binutils project.
Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
- From: Janne Blomqvist <blomqvist dot janne at gmail dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Richard Henderson <rth at twiddle dot net>, Richard Biener <richard dot guenther at gmail dot com>, "H.J. Lu" <hjl dot tools at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>, GCC Development <gcc at gcc dot gnu dot org>, Binutils <binutils at sourceware dot org>, "Girkar, Milind" <milind dot girkar at intel dot com>, "Kreitzer, David L" <david dot l dot kreitzer at intel dot com>
- Date: Thu, 25 Jul 2013 15:17:43 +0300
- Subject: Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
- References: <CAMe9rOrvMxSLj3LcYBs71tVdw6C0vJFKD2HxvnoHc13UamftwA at mail dot gmail dot com> <ddab98c2-bb3b-4d02-b403-e7d5690cfe00 at email dot android dot com> <51F01C0A dot 5050101 at twiddle dot net> <20130724185233 dot GA12562 at domone dot kolej dot mff dot cuni dot cz>
On Wed, Jul 24, 2013 at 9:52 PM, OndÅej BÃlka <firstname.lastname@example.org> wrote:
> On Wed, Jul 24, 2013 at 08:25:14AM -1000, Richard Henderson wrote:
>> On 07/24/2013 05:23 AM, Richard Biener wrote:
>> > "H.J. Lu" <email@example.com> wrote:
>> >> Hi,
>> >> Here is a patch to extend x86-64 psABI to support AVX-512:
>> > Afaik avx 512 doubles the amount of xmm registers. Can we get them callee saved please?
>> Having them callee saved pre-supposes that one knows the width of the register.
>> There's room in the instruction set for avx1024. Does anyone believe that is
>> not going to appear in the next few years?
> It would be mistake for intel to focus on avx1024. You hit diminishing
> returns and only few workloads would utilize loading 128 bytes at once.
> Problem with vectorization is that it becomes memory bound so you will
> not got much because performance is dominated by cache throughput.
> You would get bigger speedup from more effective pipelining, more
ISTR that one of the main reason "long" vector ISA's did so well on
some workloads was not that the vector length was big, per se, but
rather that the scatter/gather instructions these ISA's typically have
allowed them to extract much more parallelism from the memory
subsystem. The typical example being sparse matrix style problems, but
I suppose other types of problems with indirect accesses could benefit
as well. Deeper OoO buffers would in principle allow the same memory
level parallelism extraction, but those apparently have quite steep
power and silicon area cost scaling (O(n**2) or maybe even O(n**3)),
making really deep buffers impractical.
And, IIRC scatter/gather instructions are featured as of some
recent-ish AVX-something version. That being said, maybe current
cache-based memory subsystems are different enough from the vector
supercomputers of yore that the above doesn't hold to the same extent