The content of this blog is my personal opinion only. Although I am an employee - currently of Imagination Technologies's MIPS group, in the past of other companies such as Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, January 31, 2011



In the era of Willamette I suffered a moral dilemma - or, rather, a dilemma of morale.
I had evangelized [[OOO execution]] with P6.
But I had been introduced to runahead.
And runahead execution seemed to be asymptotically as good as OOO execution,
as memory latencies got longer - i.e. as we entered the era of [[MLP (memory level parallelism)]], which I was then starting to evangelize.

OOO execution is ... well, I'll assume you know what OOO is, or can read about it [[OOO|elsewhere]].

= What Runahead Is ==

[[Runahead]] consists essentially of
* executing
* when an unpredictable, long-latency, event such as a cache miss occurs
** take a checkpoint
** mark the result invalid
* continue executing past the checkpoint in a speculative manner
** saving speculative values into registers
** saving speculative values into a store buffer, or similar [[memory speculation datastructure]]
* when the cache miss returns
** discard all speculative state, registers and memory
** restore from the checkpoint.

= Runahead and OOO are asymptotically equal =

Runahead is a comparatively simple action.
And, in some sense, it gets nearly all of the benefit of OOO.
I.e. as memory latencies increase to infinity and execution latencies becomeinfinitesimal by comparison,
runahead execution will approach OOO execution in performance.

Normally, barring funny effects, runahead  will always be lower performance than OOO.
For the same size of instruction window.
But runahead is much cheaper in hardware, and can potentially runahead much farther than OOO for the same hardware cost.
Both runahead and OOO are limited by the [[memory speculation datastructure]];
but runahead need build no large physical register file, etc., save only for a number of RF  checkpoints.

= [[log-based]] =

Confounded by OOO/run-ahead, I looked for something better than runahead in some asymptotic sense.
Hence  [[log-based]] execution - specifically [[log-based verify re-execution]].

Although subsequently I applied [[log-based]] to [[speculative multithreading]],
at first I thought about it only wrt the same [[single sequencer microarchitecture]]
as OOO or runahead:

Imagine that you are executing in a runahead manner.
However, instead of re-executing all of the instructions after a cache miss returns and a checkpoint is restored,
instead you (1) record all of the results of runahead execution in a log,
and (2) you execute instructions out of the log.

This raises several questions:
Why would this be any faster than runahead?
Why would it be any "larger" than OOO or as large as runahead?
Why would it be as cheap as runahead?

Let;'s answer them out of order.

;Why would this be as cheap as runahead?

Well, if you had a dedicated log [[hardware  datastructure]] it would not be.
However, the log is accessed sequentially during [[verify re-execution]].
Sequential access patterns can be easily prefetched.
therefore, you could put the log in slow memory - potentially in main memory.
you might need some fast hardware log to "start off" the verify re-execution,
sufficient to tolerate the latency to main memory.
Equivalently, you might manage cache replacement to ensure that the oldest entries in the log are not displaced from a cache of main memory,
as they normally would be in LRU.

;Why would it be any "larger" than OOO or as large as runahead?

As explained above, the log need not be a small size hardware datastructure.  It could potentially scale in size as main memory.

;Why would this be any faster than runahead?

Two reasons.

First, [[verify re-execution]] is inherently more parallel than normal execution.

I like to explain this using repeatedly increment a register as an example:

 r9 := load(cache misss)
 INC r9
 INC r9
 INC r9

If INC is unit latency, then runhead re-execution would take N cycles to re-execute N such INCrements of a register.

However, in [[log-based verify re-execution]],
the example code above would be rewritten in the log as

 r9 := load(cache misss)
 assert(r9 = 1042)
 INC r9 // assert(r9=1042); r9:= 1043
 INC r9 // assert(r9=1043); r9:= 1044
 INC r9 // assert(r9=1044); r9:= 1045
 INC r9 // assert(r9=1042+N); r9:= 1042+N+1

i.e. the instructions rewritten for storage in the log,
and for [[verify re-execution]] out of the  log, consist of nothing except moves of a cionstant to a register.
Rather than N cycles, these would take N/IF cycles, where IF is the instruction width.

Furthermore, there are many optimizations that can be applied to the log:
removing redundant asserts and redundant overwrites, etc.
In general, for a block of register based code,
[[log-based verify re-execution]] has  one assert for every live-in,
and one constant to register move for every live-out.
Assuming the assertions are met.
(And if the assertions are not met, fall back to normal (re-)execution.

The second reason for log-based verify re-execution being faster is this:

Normal runahead re-execution is limited by cache associativity.
If you have N independent memory references that were cachemisses prefetched by the runahead speculative execution epoch,
they are normally independent, and all prefetched.
EXCEPT when they happen to collide in the cache, e.g. in the limited associativity cache.
In which case, another runahead/checkpoint/re-execution would begin.

Log-based verify re-execution suffers this not at all.
If a cache missing load is thrashed out of the cache,
its value is still present in the log,
and dependent instructions can still be executed - or, rather, verify re-executed.
You will need to pay the memory latency to fetch it and verify, but that can be overlapped.

= Conclusion =

Log-based verify re-execution, or`replay, for runahead
is asymptotically faster than ordinary runahead in two senses:
non-cache miss execution latency,
and cache thrashing.

Ultimately, the fashion or fad of runahead petered out.
Runahead did not replace OOO, although it caused much heartburn.

Runahead did not replace OOO because, although asymptotically better than OOO, we aren't at the asymptote.
Execution unit latencies are not 0, and cache misses infinite.
OOO handles both.

Also, runahead may consume more power than OOO.  This is somewhat debatable:
OOO's hardware datastructures may waste power, e.g. leakage,
whereas runahead may consume static power.

Probably the biggest reason why runahead did not supplant OOO in the late 1990s and early 2000s is this:
runhead is only as good as OOO.
It doesn't do anything new.
OOO was an already established and well understood technology by the time runahead appeared on the scene.
Although runahead has putative benefits, they were not enough to warrant a switch-over.

Be this as it may, log-based verify re-execution is a technique that rose out of the [[Runahead versus OOO]] debate.
It is better than runahead in some asymptotic senses.  And it has application to other areas, such as [[SpMT]].

= Hybrids =

Of course it is possible  to create hybrids of OOO, runahead, and log-based.

For example, one need not record all instruction results in the log.  One can re-execute that which is fast,
and only record in the log that which is slow.

Similarly, an OOO machine need not record all speculative results in the instruction window.

Onur Mutlu and others have proposed hybrids that fill an OOO instruction window, and then use runahead.

1 comment:

juanrga said...

I am interested in alternatives to OOO and did read many works on the subject including runahead with or without reuse, iCFP, multipass pipelining,...

I would like to know more details about log-based replay, but the wiki link gives "MediaWiki 1.22 internal error".