This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Scheduling x86 dispatch windows

From: "H.J. Lu" <hjl dot tools at gmail dot com>
To: Quentin Neill <quentin dot neill dot gnu at gmail dot com>
Cc: Jeff Law <law at redhat dot com>, binutils at sourceware dot org, gcc at gcc dot gnu dot org
Date: Thu, 10 Jun 2010 15:23:10 -0700
Subject: Re: Scheduling x86 dispatch windows
References: <723761.44710.qm@web33003.mail.mud.yahoo.com> <AANLkTimpR9VuPbpwc4tRFZndbJCf8FMWRrt_P68mgaBb@mail.gmail.com> <AANLkTilw4HSc4oMj56JtzK61oJV2kf2I9CyvH6Ti-GX0@mail.gmail.com> <4C1144F7.1040708@redhat.com> <AANLkTikaGR08gDY4964XVmE4t_WLcza1eiAdnQJEszKC@mail.gmail.com> <AANLkTikwebdip30yrPN6AZabLx9yd8QVCcOoJDESRwbD@mail.gmail.com> <AANLkTik1eMgCXq8J_Vo_mWTdwx0pZo2mnG2dCtG1jsh8@mail.gmail.com>

On Thu, Jun 10, 2010 at 3:09 PM, Quentin Neill
<quentin.neill.gnu@gmail.com> wrote:
> On Thu, Jun 10, 2010 at 4:08 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Thu, Jun 10, 2010 at 1:59 PM, Quentin Neill
>> <quentin.neill.gnu@gmail.com> wrote:
>>> On Thu, Jun 10, 2010 at 3:03 PM, Jeff Law <law@redhat.com> wrote:
>>>> On 06/10/10 13:52, H.J. Lu wrote:
>>>>> On Thu, Jun 10, 2010 at 11:05 AM, Quentin Neill
>>>>> <quentin.neill.gnu@gmail.com> ?wrote:
>>>>>> Cross-posting Reza's call for feedback to the binutils list since it
>>>>>> is relevant - s ee the last few paragraphs regarding how to
>>>>>> "solve the alignment problem".
>>>>>>
>>>>>> Original thread: http://gcc.gnu.org/ml/gcc/2010-06/threads.html#00402
>>>>>>
>>>>>> On Thu, Jun 10, 2010 at 12:20 PM, reza yazdani<yazdani_reza@yahoo.com>
>>>>>> ?wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> We are in the process of adding a feature to GCC to take advantage
>>>>>>> of a new hardware feature in the latest AMD micro processor. This
>>>>>>> feature requires a certain mix, ordering and alignments in
>>>>>>> instruction sequences to obtain the expected hardware performance.
>>>>>>>
>>>>>>> I am asking the community to review this high level implementation
>>>>>>> design and give me direction or advice.
>>>>>>>
>>>>>>> The new hardware issues two windows of the size N bytes of
>>>>>>> instructions in every cycle. It goes into accelerate mode if the
>>>>>>> windows have the right combination of instructions or alignments. Our
>>>>>>> goal is to maximize the IPC by proper instruction scheduling and
>>>>>>> alignments.
>>>>>>>
>>>>>>> Here is a summary of the most important requirements:
>>>>>>>
>>>>>>> a) Maximum of N instructions per window.
>>>>>>> b) An instruction may cross the first window.
>>>>>>> c) Each window can have maximum of x memory loads and y memory
>>>>>>> ? ?stores .
>>>>>>> d) The total number of immediate constants in the instructions
>>>>>>> ? ?of a window should not exceed k.
>>>>>>> e) The first window must be aligned on 16 byte boundary.
>>>>>>> f) A Window set terminates when a branch exists in a window.
>>>>>>> g) The number of allowed prefixes varies for instructions.
>>>>>>> h) A window set needs to be padded by prefixes in instructions
>>>>>>> ? ?or terminated by nops to ensure adherence to the rules.
>>>>>>>
>>>>>>> We have the following implementation plan for GCC:
>>>>>>>
>>>>>>> 1) Modify the Haifa scheduler to make the desired arrangement of
>>>>>>> ? ?instructions for the two dispatch windows. The scheduler is called
>>>>>>> ? ?once before and once after register allocation as usual. In both
>>>>>>> ? ?cases it performs dispatch scheduling along with its normal job of
>>>>>>> ? ?instruction scheduling.
>>>>>>>
>>>>>>> The advantage of doing it before register allocation is avoiding
>>>>>>> extra dependencies caused by register allocation which may become
>>>>>>> an obstacle to movement of instructions. ?The advantage of doing
>>>>>>> it after register allocation is a consideration for spilling code
>>>>>>> which may be generated by the register allocator.
>>>>>>>
>>>>>>> The algorithm we use is:
>>>>>>>
>>>>>>> a) Considering the current dispatch window set, choose the first
>>>>>>> ? ?instruction from ready queue that does not violate dispatch rules.
>>>>>>> b) When an instruction is selected and scheduled, inform the
>>>>>>> ? ?dispatcher code about the instruction. This step keeps track of the
>>>>>>> ? ?instruction content of windows for future evaluation. It also manages
>>>>>>> ? ?the window set by closing and opening new virtual dispatch windows.
>>>>>>>
>>>>>>> 2) Insertion of alignment code.
>>>>>>>
>>>>>>> In x86 alignment is done by inserting prefixes or by generating
>>>>>>> nops. As the object code is generated by the assembler in GCC, some
>>>>>>> information such as sizes of branches are unknown until assembly or
>>>>>>> link time. To do alignments related to dispatch correctly in GCC,
>>>>>>> we need to iteratively compute prefixes and branch sizes until
>>>>>>> its convergence. This pass currently does not exist in GCC, but it
>>>>>>> exists in the assembler.
>>>>>>>
>>>>>>> There are two possible approaches to solve alignment problem.
>>>>>>>
>>>>>>> a) ?Let the assembler performs the alignments and padding needed
>>>>>>> ? ? to adhere with the new machine dispatching rules and avoid an extra
>>>>>>> ? ? pass in GCC.
>>>>>>> b) ?Add a new pass to mimic what assembler does before generating
>>>>>>> ? ? the assembly listing in GCC and insert the required alignments.
>>>>>>>
>>>>>>> I appreciate your comments on the proposed implementation procedure
>>>>>>> and the choices a or b above.
>>>>>>>
>>>>>
>>>>> I don't this should be done in assembler. Assembler should just assemble
>>>>> the assembly input.
>>>>
>>>> That adds quite a bit of complication to the compiler though -- getting the
>>>> instruction lengths right (and thus proper packing & alignment) can be
>>>> extremely difficult. ?I did some experiments with this on a target with
>>>> *fixed* instruction lengths a while back and even though the port tried hard
>>>> to get lengths right, it would routinely miss something. ?Ultimately I
>>>> decided that it forcing the compiler to know instruction lengths with a very
>>>> high degree of accuracy wasn't a sane thing to do. ? ?Dealing with variable
>>>> instruction lengths just adds yet another complexity to the situation. ?Then
>>>> add the complication of needing to add specific prefixes or nops and it just
>>>> gets downright ugly.
>>>>
>>>> I'd probably approach this by having the compiler emit a directive which
>>>> states what the desired alignment at a particular point should be, then
>>>> allow the assembler to select the best method to get the desired alignment.
>>>
>>> Jeff,
>>>
>>> This is exactly part of our binutils side of the proposal, which I'll
>>> outline now
>>>
>>> 1. Allow multiple prefixes for ADDR and DS (and possibly others)
>>> a) multiple prefixes are benign in certain modes and are thus chosen for padding
>>> b) although ".byte" works, the "ds" and "addr" prefix mnemonics are
>>> more explicit (and they don't trigger a call to
>>> md_flush_pending_output)
>>>
>>> 2. Add new pseudo-op to delineate alignment boundaries. ?This is
>>> needed to signal any dispatch engine (below) to pad. ?Here are my top
>>> two candidates, any feedback is appreciated:
>>> a) ".flush" new psuedo op plumbed directly to "md_flush_pending_output()"
>>> b) ".padalign" which calla a new "md_pad_align()"
>>>
>>> 3. Add dispatch optimization infrastructure which
>>> a) is guarded by -mtune flag (and possibly other -f style flags)
>>> b) tracks assembled instruction attributes and their fragments
>>> c) can pad (insert benign prefixes) into previously assembled fragments
>>> d) maintains dispatch engine state (according to some subset of Reza's rules)
>>>
>>> Discussion:
>>>
>>> The flags in 3a) should guard against these changes affecting current behavior.
>>>
>>> The assembly tracking in 3b) is for bookkeeping only; the padding in
>>> 3c) would only occur when a compiler uses the pseudo-op in 2) or when
>>> the dispatch engine in 3d) signals.
>>>
>>> For compilers that know exactly how to pad for the new processor, the
>>> ability to
>>> pad explicitly using 1), 2), and .align/.balign/.p2align should be enough.
>>>
>>> For assembly programs and/or compilers that don't choose to do any
>>> dispatch optimization, it's anticipated that the engine in 3d) would
>>> be useful for optimizing for -mtune=bdver1
>>>
>>> I'll post patches for these soon.
>>
>> Can you do it with directives only?
>
> In theory, if the compiler knows all sizes and offsets, yes (given
> some way to add multiple prefixes).
>
> However in practice, no.
>
> To get ?GCC to know all would require replicating most assembler
> functionality in ?GCC, including parsing, assembling, and sizing
> (parts of output_insn() and its child output_*() functions). ?We
> considered exposing one-line assembly as a library but you have to
> provide (or reuse) the segment/frchain/fragment context, and I don't
> think introducing a GCC->binutils dependency (other than runtime)
> would be easy to introduce into the community.
>
> This wouldn't cover the assembly language case either.
>
> And remember, even if you have all the directives (and the
> programmer/compiler knows all), the assembler must remember potential
> padding locations until the decision (and knowledge about how) to pad
> arrives.
>

x86 assembler isn't an optimizing assembler. -mtune only does
instruction selection.  What you are proposing sounds like an optimizing
assembler to me. Are we going to support scheduling, macro, ...?


-- 
H.J.

Follow-Ups:
- Re: Scheduling x86 dispatch windows
  - From: Quentin Neill

References:
- Re: Scheduling x86 dispatch windows
  - From: Quentin Neill
- Re: Scheduling x86 dispatch windows
  - From: H.J. Lu
- Re: Scheduling x86 dispatch windows
  - From: Jeff Law
- Re: Scheduling x86 dispatch windows
  - From: Quentin Neill
- Re: Scheduling x86 dispatch windows
  - From: H.J. Lu
- Re: Scheduling x86 dispatch windows
  - From: Quentin Neill

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]