[[Dynamic instruction rewriting]] is a microarchitecture optimization that I, Andy Glew, came up with while I was a student at the University of Wisconsin, 1996-2000.
Undoubtedly others have had similar ideas:
I know for sure after (TBD, publication)
and suspect before.
I will describe my version here (TBD add and compare other versions).
- I often refer to this optimization as simply [[flattening]], since that it what it doesto the dataflow graph.
[[Dynamic instruction rewriting]] was one of several optimizations I grouped as [[instruction refinement]] for the PhD I never completed.
[[Dynamic instruction rewriting]] does what it says - it rewrites instructions,
in particular, flattening the dataflow graph of a computation,
in hardware.
It does this much as a compiler would do, but goes beyond a compiler's capabilities in that it performs these optimizations across basic blocks, procedures,
infrequently taken branches, and other boundaries that might impede a compiler based optimization.
(In this, I also include [["compiler-like optimizations" performed inside the instruction cache]].)
My flavour of [[dynamic instruction rewriting]] is based on letting instructions flow around a microarchitecture much as values might.
It can be performed in several places in the pipeline,
in particular in the instruction renaming pipestage,
and/or in actual execution.
Also possibly as a [[post-retirement optimization]].
The quintessential example [[dynamic instruction rewriting]] is this:
imagine an instruction sequence that increments a register.
(Actually, we will change the registers for didactic purposes, as might happen with simple loop unrolling.)
...
r2 := r1 + 1
...
r1 := r2 + 1
...
r2 := r1 + 1
...
Let us discuss the renamer pipestage optimization.
The renamer or map pipestage has a table indexed by logical register number.
In a simple in-orrder scoreboarded machine, the entries of this table may be single bity indicating whether a register value is ready or not;
in a P6 or [[HaRRM]] style machine, the entries of this table may also contain physical register numbers that the logical registers are mapped to, which are scheduled later;
in a [[future file]] or [[active register file]], the actual values may also be in this table.
In renamer pipestage [[dynamic register rewriting]], the table also includes the instruction, the actual instruction, that writes the result.
Or at least a subsert thereof (the subset that can be optimized in this manner).
Considering the above code sequence, let us indicate the value of a P6 or [[HaRRM]] style register renamer, mapping logical to physical register numbers
The actual physical register numbers do not matter, and could be random - whatever preg is free next.
...
r1 -> pr11
r2 -> pr22
r2 := r1 + 1
r1 -> pr11
r2 -> pr33
...
r1 := r2 + 1
r1 -> pr44
r2 -> pr33
...
r2 := r1 + 1
r1 -> pr44
r2 -> pr55
...
Now let us augment with the actual instruction producing the result - but without performing the optimization
...
r1 -> pr11
r2 -> pr22
r2 := r1 + 1
r1 -> pr11
r2 -> pr33 pr33 := pr11 + 1
...
r1 := r2 + 1
r1 -> pr44 pr44 := pr33 + 1
r2 -> pr33 pr33 := pr11 + 1
...
r2 := r1 + 1
r1 -> pr44 pr44 := pr33 + 1
r2 -> pr55 pr55 := pr44 + 1
Now let us imagine that renamer logic is enabled to substitute the input "value" of a register - the instruction - in its ouutput expression
...
r1 -> pr11
r2 -> pr22
r2 := r1 + 1
r1 -> pr11
r2 -> pr33 pr33 := pr11 + 1
...
r1 := r2 + 1
r1 -> pr44 pr44 := pr33 + 1 == (pr11 + 1) + 1
r2 -> pr33 pr33 := pr11 + 1
...
r2 := r1 + 1
r1 -> pr44 pr44 := pr33 + 1
r2 -> pr55 pr55 := pr44 + 1
When the [[dynamic instruction rewriting]] logic sees a value of tyhe form
(pr11 + 1) + 1
it performs the associative and commutative transformations, to produce optimizations
=> pr11 + 1 + 1
=> pr11 + 2
and overall
...
r1 -> pr11
r2 -> pr22
r2 := r1 + 1
r1 -> pr11
r2 -> pr33 pr33 := pr11 + 1
...
r1 := r2 + 1
r1 -> pr44 pr44 := pr33 + 1 == (pr11 + 1) + 1 == pr11 + 2
r2 -> pr33 pr33 := pr11 + 1
...
r2 := r1 + 1
r1 -> pr44 pr44 := pr33 + 1 == pr11+2
r2 -> pr55 pr55 := pr44 + 1 == pr11+3
and so on.
Actually, nothing in the above description depends on the [[dynamic instruction rewriting]] logic being in the renamer pipestage.
If it is in the renamer pipestage, and there is no additional delay for the optimization, we would never see a multi-stage transformation
pr55:=pr44+1 == (pr33+1)+1 == ((pr11+2)+1) == pr11+3
i.e. if the [[dynamic instruction rewriting]] logic at the renamer pipestage were fast enough, only the optimized version would propagate.
However, if the [[dynamic instruction rewriting]] logic were in the out-of-order execution system, we might see such multistage transformations.
(Also, if renamer pipestage logic had a delay.)
Indeed, we might place unrewritten instructions into the out-of-order scheduler,
and trigger a rewrite when one of the operands arrives
- but not the other, without waiting for the second or other operands to arrive.
I.e. this [[dynamic instruction rewriting optimization]] can take advantage of [[operand inter-arrival time skew]],
indeed, of [[dynamic operand inter-arrival time skew]], when the operand arrival times cannot be estimated in advance.
(Another class of optimizations can be performed when you can estimate [[operand inter-arrival time skew]].)
Consider
...
r1 := load ( ... ) // cache miss, unpredictably randomized arrival time
r2 := load ( ... ) // cache miss, unpredictably randomized arrival time
r3 := load ( ... ) // cache miss, unpredictably randomized arrival time
r4 := r1 + r2
r5 := r4 + r3
...
Let all of these operations be placed into an out-of-order scheduler.
(The renamer for this is trivial.)
Now let us assume that the r3 load completes early, r3:=C
...
r1 := load ( ... ) // cache miss, unpredictably randomized arrival time
r2 := load ( ... ) // cache miss, unpredicataly randomized arrival time
COMPLETE C <= r3 := load ( ... ) // cache miss, unpredicatably randomized arrival time
r4 := r1 + r2
r5 := r4 + (r3=C)
...
Now let us assume that the r2 load completes early, r2:=B
...
r1 := load ( ... ) // cache miss, unpredictably randomized arrival time
COMPLETE B <= r2 := load ( ... ) // cache miss, unpredicataly randomized arrival time
COMPLETE C <= r3 := load ( ... ) // cache miss, unpredicatably randomized arrival time
r4 := r1 + (r2=B)
r5 := r4 + (r3=C)
...
Then let us "fire" the partially complete instruction
r4 := r1 + (r2=B)
producing
...
r1 := load ( ... ) // cache miss, unpredictably randomized arrival time
COMPLETE B <= r2 := load ( ... ) // cache miss, unpredicataly randomized arrival time
COMPLETE C <= r3 := load ( ... ) // cache miss, unpredicatably randomized arrival time
r4 := r1 + (r2=B)
r5 := r4 + (r3=C) == (r4=(r1+(r2=B))+(r3=C))== r1 + (B+C)
...
where B+C is a constant that can be evaluated, and written into the scheduler entry for r5:=.
Therefore, when r1 completes, both
r4 := r1 + B
and
r5 := r1 + (B+C)
can be evaluated in dataflow graph depth 1 - rather than the two if not rewritten.