Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Saturday, January 15, 2011

Dynamic Instruction Rewriting

http://semipublic.comp-arch.net/wiki/Dynamic_instruction_rewriting

[[Dynamic instruction rewriting]] is a microarchitecture optimization that I, Andy Glew, came up with while I was a student at the University of Wisconsin, 1996-2000.
Undoubtedly others have had similar ideas:
I know for sure after (TBD, publication)
and suspect before.
I will describe my version here (TBD add and compare other versions).


    I often refer to this optimization as simply [[flattening]], since that it what it doesto the dataflow graph. [[Dynamic instruction rewriting]] was  one of several optimizations I grouped as [[instruction refinement]] for  the PhD I never completed.



[[Dynamic instruction rewriting]] does what it says - it rewrites instructions,
in particular, flattening the dataflow graph of a computation,
in hardware.
It does this much as a compiler would do, but goes beyond a compiler's capabilities in that it performs these optimizations across basic blocks, procedures,
infrequently taken branches, and other boundaries that might impede a compiler based optimization.
(In this, I also include [["compiler-like optimizations" performed inside the instruction cache]].)

My flavour of [[dynamic instruction rewriting]] is based on letting instructions flow around a microarchitecture much as values might.

It can be performed in several places in the pipeline,
in particular in the   instruction renaming pipestage,
and/or in actual execution.
Also possibly as a [[post-retirement optimization]].

The quintessential example [[dynamic instruction rewriting]] is this:
imagine an instruction sequence that increments a register.
(Actually, we will change  the registers for didactic purposes, as might happen with simple loop unrolling.)
   ...
   r2 := r1 + 1
   ...
   r1 := r2 + 1
   ...
   r2 := r1 + 1
   ...

Let us discuss the renamer pipestage optimization.

The renamer or map pipestage has a table indexed by logical register number.
In a simple in-orrder scoreboarded machine, the entries of this table may be  single bity indicating whether a register value is ready or not;
in a P6 or [[HaRRM]] style machine, the entries of this table may also contain physical register numbers that the logical registers are mapped to, which are scheduled later;
in a [[future file]] or [[active register file]], the actual values may also be in this table.

In renamer pipestage [[dynamic register rewriting]], the  table also includes  the instruction, the actual instruction, that writes the result.
Or at least a subsert thereof (the subset that can be optimized in this manner).

Considering the above code sequence, let us indicate the value of a P6 or [[HaRRM]] style register renamer, mapping logical to physical register numbers
The actual physical register numbers do not matter, and could be random - whatever preg is free next.
      
   ...
     r1 -> pr11
     r2 -> pr22
   r2 := r1 + 1
     r1 -> pr11
     r2 -> pr33
   ...
   r1 := r2 + 1
     r1 -> pr44
     r2 -> pr33
   ...
   r2 := r1 + 1
     r1 -> pr44
     r2 -> pr55
   ...

Now let us augment with the actual instruction producing the result - but without performing the optimization


   ...
     r1 -> pr11
     r2 -> pr22
   r2 := r1 + 1
     r1 -> pr11  
     r2 -> pr33   pr33 := pr11 + 1
   ...
   r1 := r2 + 1
     r1 -> pr44   pr44 := pr33 + 1
     r2 -> pr33   pr33 := pr11 + 1
   ...
   r2 := r1 + 1
     r1 -> pr44   pr44 := pr33 + 1
     r2 -> pr55   pr55 := pr44 + 1

Now let us imagine that renamer logic is enabled to substitute the input "value" of a register - the instruction - in its ouutput expression

   ...
     r1 -> pr11
     r2 -> pr22
   r2 := r1 + 1
     r1 -> pr11  
     r2 -> pr33   pr33 := pr11 + 1
   ...
   r1 := r2 + 1
     r1 -> pr44   pr44 := pr33 + 1 == (pr11 + 1) + 1
     r2 -> pr33   pr33 := pr11 + 1
   ...
   r2 := r1 + 1
     r1 -> pr44   pr44 := pr33 + 1
     r2 -> pr55   pr55 := pr44 + 1

When the [[dynamic instruction rewriting]] logic sees a value of tyhe form
   (pr11 + 1) + 1
it performs the associative and commutative transformations, to produce optimizations
   => pr11 + 1 + 1
   => pr11 + 2

and overall

   ...
     r1 -> pr11
     r2 -> pr22
   r2 := r1 + 1
     r1 -> pr11  
     r2 -> pr33   pr33 := pr11 + 1
   ...
   r1 := r2 + 1
     r1 -> pr44   pr44 := pr33 + 1 == (pr11 + 1) + 1 == pr11 + 2
     r2 -> pr33   pr33 := pr11 + 1
   ...
   r2 := r1 + 1
     r1 -> pr44   pr44 := pr33 + 1 == pr11+2
     r2 -> pr55   pr55 := pr44 + 1 == pr11+3

and so on.


Actually, nothing in the above description depends on the [[dynamic instruction rewriting]] logic being in the renamer pipestage.

If it is in the renamer pipestage, and there is no additional delay for the optimization, we would never see a multi-stage transformation
  pr55:=pr44+1 == (pr33+1)+1 == ((pr11+2)+1) == pr11+3
i.e. if the [[dynamic instruction rewriting]] logic at the  renamer pipestage were fast enough, only the optimized version would propagate.


However, if the [[dynamic instruction rewriting]] logic were in the out-of-order execution system, we might see such multistage  transformations.
(Also, if renamer pipestage logic had a delay.)

Indeed, we might place unrewritten instructions into the out-of-order scheduler,
and trigger a rewrite when one of the operands arrives
- but not the other, without waiting for the second  or other operands to arrive.
I.e. this  [[dynamic instruction rewriting optimization]] can take advantage of [[operand inter-arrival time skew]],
indeed, of [[dynamic operand inter-arrival time skew]], when the operand arrival times cannot be estimated in advance.

(Another  class of optimizations can be performed when you can estimate [[operand inter-arrival time skew]].)

Consider
    ...
    r1 := load ( ... ) // cache  miss, unpredictably randomized arrival time
    r2 := load ( ... ) // cache  miss, unpredictably randomized arrival time
    r3 := load ( ... ) // cache  miss, unpredictably randomized arrival time
    r4 := r1 + r2
    r5 := r4 + r3
    ...

Let all of these operations be  placed  into an out-of-order scheduler.
(The renamer for this is trivial.)

Now let us assume that the r3 load completes early, r3:=C

    ...
    r1 := load ( ... ) // cache  miss, unpredictably randomized arrival time
    r2 := load ( ... ) // cache  miss, unpredicataly randomized arrival time
    COMPLETE C <= r3 := load ( ... ) // cache  miss, unpredicatably randomized arrival time
    r4 := r1 + r2
    r5 := r4 + (r3=C)
    ...

Now let us assume that the r2 load completes early, r2:=B

    ...
    r1 := load ( ... ) // cache  miss, unpredictably randomized arrival time
    COMPLETE B <= r2 := load ( ... ) // cache  miss, unpredicataly randomized arrival time
    COMPLETE C <= r3 := load ( ... ) // cache  miss, unpredicatably randomized arrival time
    r4 := r1 + (r2=B)
    r5 := r4 + (r3=C)
    ...

Then let us "fire" the partially complete instruction
    r4 := r1 + (r2=B)
producing


    ...
    r1 := load ( ... ) // cache  miss, unpredictably randomized arrival time
    COMPLETE B <= r2 := load ( ... ) // cache  miss, unpredicataly randomized arrival time
    COMPLETE C <= r3 := load ( ... ) // cache  miss, unpredicatably randomized arrival time
    r4 := r1 + (r2=B)
    r5 := r4 + (r3=C) == (r4=(r1+(r2=B))+(r3=C))== r1 + (B+C)
    ...

where B+C is a constant that can be evaluated, and written into the  scheduler entry for r5:=.

Therefore, when r1 completes, both
   r4 := r1 + B
and
   r5 := r1 + (B+C)
can be evaluated in dataflow graph depth 1 - rather than the two if not rewritten.