Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Wednesday, March 02, 2011

Dynamic dead code elimination and hardware futures

[[Category:BS]]
http://semipublic.comp-arch.net/wiki/Dynamic_dead_code_elimination_and_hardware_futures
= [[Dynamic dead code elimination]] =

It has been shown that often more than 1/3 of all computations, data movement, etc., are dynamically dead - they will never be used by the program, on its current path of execution.

There have been hardware proposals that delay computations, and elide some such dynamically dead code. Something like the flip side of [[hardware futures]].

; [[Semi-static optimization]]

Dynamic dead code elimination may be done as a [[semi-static optimization]] in an [[optimized trace cache]].
Basically, one applies classic compiler algorithms to the code to be cached,
augmented by reachability predicates.

Or, another classic way of doing hardware optimization is to examine the post-retirement instruction stream, eliminate dead code, and install that in the [[optimized trace cache]].

; [[Rename time optimization]]

Dynamic dead code elimination may also be done in the [[renamer]] pipestage. The basic idea is, instead of emitting uops as the are decoded,one assembles blocks of uops in the renamer.
Essentially, large expressions, potentially with multiple inputs and outputs.
One does not emit such a block of uops from the renamer to execution until a usage of the block is detected.

Typically, a store that may be visible to another processor constitutes a usage that will cause uops to be emitted.
(Although certain stores can be elided, e.g. push/pop/push [[stack optimizations]] and [[temporary variable optimizations]].)

Similarly, the evaluation of a branch condition may cause uops to be emitted. But one only need emit the subset of the buffered uops necessary to evaluate the branch.

Exceptions and other situations whether the entire program state may cause uops to be emitted.

And overflow of the renamer buffer.

If the output of an expression buffered in the renamer is overwritten, that part of the expression can be eliminated as dead code.

; [[Dataflow optimization]]

[[Dynamic dead code elimination]] can also be done in the OOO dataflow scheduling part of the machine.
Essentially, one assembles the dataflow graph, but one does not necessarily start execution as soon as inputs are ready.

The renamer adds an additional [[WAW]] arc to the dataflow graph, for a new uop that overwrites an old uop's output.
If that old uop has no [[RAW]] dependencies, then it can "fire", cancelling itself,
and signalling that cancellation to its own input dependencies in turn.

This is essentially [[reverse dataflow execution]].
It is straightforward, although it does require hardware that is not normally present.
It is easiest to accomplish with a [[bit matrix]] scheduler, where the [[input bitvector]] becomes the [[output bitvector]] of a cancelled uop.

It is not at all clear how useful this form of dataflow dead code elimination is:
it is buildable, but how much does it help?
It is probably most useful if it can be remembered in an [[optimized trace cache]].



= [[Hardware futures]] =

[[Hardware futures]] are the dual of [[dynamic dead code elimination]].
Instead of eliminating code that does not need to be executed,
it delays execution and evaluation of code until it is known to be needed.
This delay naturally accomplishes [[dynamic dead code elimination]].
It also provides increased scope for [[hardware CSE]],
and other optimizations.
But it potentially increases latency.

As for [[dynamic dead code elimination]],
it is fairly easy to see how to do this at the renamer:
have the renamer accumulate expressions.
Only release an expression to be evaluated when a usage that demands evaluation is detected.

We can go further, and avoid the buffering at the renamer becoming an issue by emitting such buffered instructions for a hardware future
into a sort of side car storage, rather like that of Akkary's [[CFP]].
When evaluation is demanded, we issue a token to cause insertion of those instructions.

Issue: are the buffered instructions pre or post rename? It is easiest to buffer them post rename,
using dataflow single assignment registers and memory locations.

Also as above, this can be deferred into the OOO dataflow scheduler.
But, again, that is using a fairly critical resource.

To make hardware futures less dynamic, more [[semi-static]] at the [[optimized trace cache]]
involves dividing the future into two parts:
the first part captures any live inputs,
that must be captured lest they change;
the second part does the actual evaluation on demand.
Typically memory load values must be live inputs, as well as registers,
and the future is constrained from doing stores that may be visible to other threads.

I.e. the hardware future is constrained to be a chunk of code that has no internal memory references.
Or at least only sharply limited internal memory references.

It is unclear how useful such a future would be. It would have to be saving considerable register based computation
- complicated instructions such as transcendentals - or be a complicated calculation on a small amount of memory.

Such hardware futures might be much more useful if we were released from the constraints of the [[Von Neumann memory model]],
e.g. if we had classic dataflow style [[single assignment memory]].
But in 2011 that does not seem likely any time soon.

Using Redundancies to Find Errors

Cleaning out old journals from my bookshelf.

Came across

Yichen Xie, Dawson Engler, "Using Redundancies to Find Errors," IEEE Transactions on Software Engineering, pp. 915-928, October, 2003

(actually, a slightly different version in Software Engineering Notes / SIGSOFT 2002/FSE 10.

Very interesting paper: used analysis tools that detected dead code, redundant conditionals, idempotent operations such as x&x to detect bugs.

Causes me to wonder about applying similar techniques to look for hardware/RTL/HDL bugs, e.g. in Verilog or VHDL.

For that matter, it caused me to think about similar hardware techniques. For example, it has been shown that often more than 1/3 of all computations, data movement, etc., are dynamically dead - they will never be used by the program, on its current path of execution. There have been hardware proposals that delay computations, and elide some such dynamically dead code. Something like the flip side of hardware futures.

Now, such dynamic dead code detection *might* help locate bugs - but it suffers because it is dynamic, bound to the current path of execution. Whereas Xie's work looks at reachable code, including paths that may not be currently taken but which might use the values that are dead on the current path.

--

Done in xgcc, so code should be publicly available.

I wish I had had such tools in my last project.