Krazy Glew's Blog: 01/01/2011

Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, January 31, 2011

Slow :-(

Minor bitching: writing in the comp-arch wiki is so slow.

Hard to get big chunks of time.

Hard to write big chunks when I don't have big chunks of time.

But at least my present employer allows me to write this. And does not forbid me, like my past employer.

I wish that I could average 1 new page a day. Instead, I am averaging less than half a page a day, and probably less than 1 substantial page a week.

OOO_versus_Runahead_versus_Log-based_replay_for_runahead

http://semipublic.comp-arch.net/wiki/OOO_versus_Runahead_versus_Log-based_replay_for_runahead

In the era of Willamette I suffered a moral dilemma - or, rather, a dilemma of morale.
I had evangelized [[OOO execution]] with P6.
But I had been introduced to runahead.
And runahead execution seemed to be asymptotically as good as OOO execution,
as memory latencies got longer - i.e. as we entered the era of [[MLP (memory level parallelism)]], which I was then starting to evangelize.

OOO execution is ... well, I'll assume you know what OOO is, or can read about it [[OOO|elsewhere]].

= What Runahead Is ==

[[Runahead]] consists essentially of
* executing
* when an unpredictable, long-latency, event such as a cache miss occurs
** take a checkpoint
** mark the result invalid
* continue executing past the checkpoint in a speculative manner
** saving speculative values into registers
** saving speculative values into a store buffer, or similar [[memory speculation datastructure]]
* when the cache miss returns
** discard all speculative state, registers and memory
** restore from the checkpoint.

= Runahead and OOO are asymptotically equal =

Runahead is a comparatively simple action.
And, in some sense, it gets nearly all of the benefit of OOO.
I.e. as memory latencies increase to infinity and execution latencies becomeinfinitesimal by comparison,
runahead execution will approach OOO execution in performance.

Normally, barring funny effects, runahead will always be lower performance than OOO.
For the same size of instruction window.
But runahead is much cheaper in hardware, and can potentially runahead much farther than OOO for the same hardware cost.
Both runahead and OOO are limited by the [[memory speculation datastructure]];
but runahead need build no large physical register file, etc., save only for a number of RF checkpoints.

= [[log-based]] =

Confounded by OOO/run-ahead, I looked for something better than runahead in some asymptotic sense.
Hence [[log-based]] execution - specifically [[log-based verify re-execution]].

Although subsequently I applied [[log-based]] to [[speculative multithreading]],
at first I thought about it only wrt the same [[single sequencer microarchitecture]]
as OOO or runahead:

Imagine that you are executing in a runahead manner.
However, instead of re-executing all of the instructions after a cache miss returns and a checkpoint is restored,
instead you (1) record all of the results of runahead execution in a log,
and (2) you execute instructions out of the log.

This raises several questions:
Why would this be any faster than runahead?
Why would it be any "larger" than OOO or as large as runahead?
Why would it be as cheap as runahead?

Let;'s answer them out of order.

;Why would this be as cheap as runahead?

Well, if you had a dedicated log [[hardware datastructure]] it would not be.
However, the log is accessed sequentially during [[verify re-execution]].
Sequential access patterns can be easily prefetched.
therefore, you could put the log in slow memory - potentially in main memory.
you might need some fast hardware log to "start off" the verify re-execution,
sufficient to tolerate the latency to main memory.
Equivalently, you might manage cache replacement to ensure that the oldest entries in the log are not displaced from a cache of main memory,
as they normally would be in LRU.

;Why would it be any "larger" than OOO or as large as runahead?

As explained above, the log need not be a small size hardware datastructure. It could potentially scale in size as main memory.

;Why would this be any faster than runahead?

Two reasons.

First, [[verify re-execution]] is inherently more parallel than normal execution.

I like to explain this using repeatedly increment a register as an example:

r9 := load(cache misss)
INC r9
INC r9
...
INC r9

If INC is unit latency, then runhead re-execution would take N cycles to re-execute N such INCrements of a register.

However, in [[log-based verify re-execution]],
the example code above would be rewritten in the log as

r9 := load(cache misss)
assert(r9 = 1042)
INC r9 // assert(r9=1042); r9:= 1043
INC r9 // assert(r9=1043); r9:= 1044
INC r9 // assert(r9=1044); r9:= 1045
...
INC r9 // assert(r9=1042+N); r9:= 1042+N+1

i.e. the instructions rewritten for storage in the log,
and for [[verify re-execution]] out of the log, consist of nothing except moves of a cionstant to a register.
Rather than N cycles, these would take N/IF cycles, where IF is the instruction width.

Furthermore, there are many optimizations that can be applied to the log:
removing redundant asserts and redundant overwrites, etc.
In general, for a block of register based code,
[[log-based verify re-execution]] has one assert for every live-in,
and one constant to register move for every live-out.
Assuming the assertions are met.
(And if the assertions are not met, fall back to normal (re-)execution.

The second reason for log-based verify re-execution being faster is this:

Normal runahead re-execution is limited by cache associativity.
If you have N independent memory references that were cachemisses prefetched by the runahead speculative execution epoch,
they are normally independent, and all prefetched.
EXCEPT when they happen to collide in the cache, e.g. in the limited associativity cache.
In which case, another runahead/checkpoint/re-execution would begin.

Log-based verify re-execution suffers this not at all.
If a cache missing load is thrashed out of the cache,
its value is still present in the log,
and dependent instructions can still be executed - or, rather, verify re-executed.
You will need to pay the memory latency to fetch it and verify, but that can be overlapped.

= Conclusion =

Log-based verify re-execution, or`replay, for runahead
is asymptotically faster than ordinary runahead in two senses:
non-cache miss execution latency,
and cache thrashing.

Ultimately, the fashion or fad of runahead petered out.
Runahead did not replace OOO, although it caused much heartburn.

Runahead did not replace OOO because, although asymptotically better than OOO, we aren't at the asymptote.
Execution unit latencies are not 0, and cache misses infinite.
OOO handles both.

Also, runahead may consume more power than OOO. This is somewhat debatable:
OOO's hardware datastructures may waste power, e.g. leakage,
whereas runahead may consume static power.

Probably the biggest reason why runahead did not supplant OOO in the late 1990s and early 2000s is this:
runhead is only as good as OOO.
It doesn't do anything new.
OOO was an already established and well understood technology by the time runahead appeared on the scene.
Although runahead has putative benefits, they were not enough to warrant a switch-over.

Be this as it may, log-based verify re-execution is a technique that rose out of the [[Runahead versus OOO]] debate.
It is better than runahead in some asymptotic senses. And it has application to other areas, such as [[SpMT]].

= Hybrids =

Of course it is possible to create hybrids of OOO, runahead, and log-based.

For example, one need not record all instruction results in the log. One can re-execute that which is fast,
and only record in the log that which is slow.

Similarly, an OOO machine need not record all speculative results in the instruction window.

Onur Mutlu and others have proposed hybrids that fill an OOO instruction window, and then use runahead.

Saturday, January 29, 2011

Cygwin X multidisplay bug

Today I figured out a problem that has been plaguing me on Cygwin for a year or two, since things broke in an update. (I probably figured it out before, but neglected to record or remember it.)

When I start up Xwin -multiwindow, using windows manipulated by MS Windows rather than a desktop in an MS Window window (try saying that fast), the initial xterm was always invisible.

Eventually I realized that if I maximized it from the MS Windows task bar, I could see it.

Apparently, this window was being created at location (0,0) in a bounding box for my multi-display configuration. (That's multiple physical displays, not multiple X displays.) However, location (0,0) was invisible - since my multi-display configuration has a big tall portrait mode window in the middle for reading PDFs, with flat landscape displays on either side for webpages. The portrait display rising above landscape displays.

By the way, it is actually +1+1 asthe default location, not +0+0. Minor difference.

I need to play around and see if I can find a guaranteed visible place.

Removing the geometry spec seems to work okay for me.

Friday, January 28, 2011

Why_full-speed_denormal_handling_makes_more_sense_for_CPUs_than_CPUs

http://semipublic.comp-arch.net/wiki/Why_full-speed_denormal_handling_makes_more_sense_for_CPUs_than_CPUs

o a collector of computer architecture trivia like me,
it was quite remarkable
that the Nvidia Fermi GPU
provides full-speed support for denormal operands and results,
whereas Intel and AMD CPUs, at the time of writing, do not.

By the way, terminology: [[full-speed support for denormals]] is not quite the same as [[hardware support for denormals]].

The first question is whether you provide hardware support,
or whether it is necessary to [[trap to microcode or software]], to provide the denormal support.

The second question is,
if you do provide hardware support for denormals and do not need to trap to microcode or software.
Even if denormals are performed in hardware, there may be a performance cost:
it may cost latency, e.g. 5 cycles for an FADD rather than 4 cycles;
or it may cost bandwidth.
The latency and bandwidth impacts are related but may be decoupled:
e.g. it is possible that a scheduler can arrange so that throughput is not reduced even if latency is increased,
so long as there is sufficient [[ILP]].

By [[full-speed denorms]] we mean mainly that throughput or bandwidth is not affected.
E.g. you may arrange to add to the latency of all FP ops,
to avoid [[writeback port collisions for operations with different latencies]].
Since GPUs are throughput machines, this is usually a reasonable tradeoff;
on some GPUs even integer arithmetic is stretched out to 40 cycles.

So, why would a GPU like Nvidia Fermi provide full-speed denorms,
whereas x86 CPUs from Intel and AMD do not?

Let's skip the possibility of a marketing bullet.

Remember [[GPU-style SIMD or SIMT coherent threading]]?

If a GPU takes a trap, or otherwise takes a hiccup, to handle denorms, then not only is the current thread impacted.
All 16 or 64 threads the [[wavefront or warp]] are impacted.
And if only one of the [[spatial threads]] in the [[warp]] is taking the trap,
the efficiency of the machine decreases by at least 16-64X in that period.
Let alone the possibility that the denorm handling code is also not well suited to a GPU.

I.e. the relative cost of [[denorm handling via trapping]] is much higher on a GPU than on a CPU.
Even denorm handling in hardware, in a manner that impacts latency or throughput,
is relatively more expensive and/or harder to deal with.

Hence: there are good technical reasons to do [[full-speed denorm handling]] in GPUs.
These reasons are not quite so compelling for CPUs
- although I predict that ultimately CPUs will be compelled to follow.

--

Ancedote: on one of my few trips to Pixar, I talked to Bruce Perens about denorm handling. Apparently Pixar would be happily rendering a movie at high speed, and then Bam! they would have a scene like a night sky, black, full of denorms, and performance would fall off a cliff. We talked about flush-to-zero modes, which Intel x86 CPUs did not have at that time. I suggested biasing the numbers, e.g. adding 1 or similar, to prevent getting into denorm range. Now that x86 has flush-to-zero the need for such [[kluge]]s is greatly reduced. But, denorms exist for a reason. flush-to-zero can introduce artifacts, and always introduces [[FUD (Fear, Uncretainty, and Doubt)]]. [[Full-speed support for denorms]] may just be the coming thing, a place where GPUs lead the way for CPUs.

Thursday, January 27, 2011

ISO mediawaiki macro

I'm still using mediawiki for my comp-arch wiki.

I want a "macro", something like

   {{acronym_and_term SIMT "simultaneous multithreading"}

that will create pages

   SMT
   simultaneous mulrtithreading
   SMT (simultaneous multithreading)

and redirect them all to
   simultaneous multithreading (SMT)


Come to think of it, a primitive
   {{create_redirect_to_current_page Redirect_From}
would be nice.

I imagine that it would first see if the Redirect_From page already exists. If not, it would create
it, redirecting to the current page. If so, it would probably not do anything - which might be a minor lossage if
the Redirect_From page doesn't link to the current page, but which allows things like disambiguation pages.

Perfunctory attempts to find this on mediawiki fail; I'm not sure where to go ask (and I confess to not liking the wikimedia community).

Processor redundancy: FRC/TMR/QMR RAS

http://semipublic.comp-arch.net/wiki/Processor_redundancy

---

[[FRC]] - [[failure redundant computation]]. A fairly generic term.
However, at Intel prior to 1990 or so it often referred to [[master-checker]] pairs
- microprocessors wired tiogether at the pins,
one chip driving the pins,
the other comparing what it would drive, were it not configured in FRC checker mode,
to what is actually being driven.
If a different is detected, an error is asserted.

[[TMR]] - [[three module redundancy]] - three processors, voting to choose outcome.
The loser may be deactivated, failing down to [[FRC]].

[[QMR]] - [[quad module redundancy]] - usually 2 [[FRC]] pairs. NOT a voting scheme.
One [[master-checker]] pair is designated active, and its outputs are actually used.
The other [[master-checker]] pair is designated inactive.
If the active pair exhibits a difference,
it is failed, and the other pair continues the computation.

The inactive pair follows the computation so that its state will be "hot".
It probably makes sense to compare the inactive pair's results to the active pair's results,
although if there is such a difference between the pairs
but not within the pairs, it is not clear which can be trusted.

---

[[TMR]] and other voting schemes requires somewhat challenging external voting logic.

[[FRC]] [[master-checker]] pairs require much less external logic: most logic is within the CPU chip.

[[QMR]] is built out of [[FRC]] [[master-checker]] pairs.
It requires no voting logic.
The comparison logic is within the chip, as in [[FRC]] pairs.
You might imagine needing external logic to select which pair's outputs should be used;
however, this may not be necessary
if you trust the einternal logic of an FRC pair to disable its outputs.
I.e. if asserting FRCERR from a checker can reliably disable the master's outputs, then no external logic may be needed.

However, such multiple drivers per signal configurations are now deprecated (circa 2010),
so external muxes may be necessary.

---

Above we have talking about FRC/TMR/QMR between chips.
However, it can be applied to any logic block, potentially within the same chip
(although then chip failures might corrupt both).

Similarly, we have talked about doing FRC/TMW/QMR RAS for processors,
but these techniques can be applied to non-processor logic.

Sunday, January 16, 2011

ublic_comp-arch_wiki_shut_down_because_of_attack_corruption

The title pretty much says it all:
the public comp-arch wiki was shutdown because of an attack
that caused corruption
- specifically, pages direct to what can only be assumed to be malware infested websites

See [[WikiAdminLog: Description of attack on wiki.public.comp-arch.net, November 2011]].

My hope has long been to have a wik site for computer architecture discussions.
I set up a semipublicarea, writeable by me and readable by the world,
and a public area, read/write by the world
(with whatever security mediawiki provides, e.g. Captchas).

The public site has been attacked twice.
Can't really say that the security was broken,
just that the attackers or spammers took advantage of the openness of wiki.

Shutting it down.
Jan 16, 2011.

http://wiki.public.comp-arch.net/ - public wiki - now shut down

semipublic wiki still okay:
http://comp-arch.net
http://semipublic.comp-arch.net
https://www.semipublic.comp-arch.net/wiki/index.php?title=Main_Page

Description of attack
https://www.semipublic.comp-arch.net/wiki/WikiAdminLog:_Description_of_attack_on_wiki.public.comp-arch.net,_November_2011

This page
https://www.semipublic.comp-arch.net/wiki/WikiAdmin:_public_comp-arch_wiki_shut_down_because_of_attack_corruption

Saturday, January 15, 2011

Dynamic Instruction Rewriting

http://semipublic.comp-arch.net/wiki/Dynamic_instruction_rewriting

[[Dynamic instruction rewriting]] is a microarchitecture optimization that I, Andy Glew, came up with while I was a student at the University of Wisconsin, 1996-2000.
Undoubtedly others have had similar ideas:
I know for sure after (TBD, publication)
and suspect before.
I will describe my version here (TBD add and compare other versions).

I often refer to this optimization as simply [[flattening]], since that it what it doesto the dataflow graph. [[Dynamic instruction rewriting]] was one of several optimizations I grouped as [[instruction refinement]] for the PhD I never completed.

[[Dynamic instruction rewriting]] does what it says - it rewrites instructions,
in particular, flattening the dataflow graph of a computation,
in hardware.
It does this much as a compiler would do, but goes beyond a compiler's capabilities in that it performs these optimizations across basic blocks, procedures,
infrequently taken branches, and other boundaries that might impede a compiler based optimization.
(In this, I also include [["compiler-like optimizations" performed inside the instruction cache]].)

My flavour of [[dynamic instruction rewriting]] is based on letting instructions flow around a microarchitecture much as values might.

It can be performed in several places in the pipeline,
in particular in the instruction renaming pipestage,
and/or in actual execution.
Also possibly as a [[post-retirement optimization]].

The quintessential example [[dynamic instruction rewriting]] is this:
imagine an instruction sequence that increments a register.
(Actually, we will change the registers for didactic purposes, as might happen with simple loop unrolling.)
   ...
   r2 := r1 + 1
   ...
   r1 := r2 + 1
   ...
   r2 := r1 + 1
   ...

Let us discuss the renamer pipestage optimization.

The renamer or map pipestage has a table indexed by logical register number.
In a simple in-orrder scoreboarded machine, the entries of this table may be single bity indicating whether a register value is ready or not;
in a P6 or [[HaRRM]] style machine, the entries of this table may also contain physical register numbers that the logical registers are mapped to, which are scheduled later;
in a [[future file]] or [[active register file]], the actual values may also be in this table.

In renamer pipestage [[dynamic register rewriting]], the table also includes the instruction, the actual instruction, that writes the result.
Or at least a subsert thereof (the subset that can be optimized in this manner).

Considering the above code sequence, let us indicate the value of a P6 or [[HaRRM]] style register renamer, mapping logical to physical register numbers
The actual physical register numbers do not matter, and could be random - whatever preg is free next.

   ...
   r1 -> pr11
   r2 -> pr22
   r2 := r1 + 1
   r1 -> pr11
   r2 -> pr33
   ...
   r1 := r2 + 1
   r1 -> pr44
   r2 -> pr33
   ...
   r2 := r1 + 1
   r1 -> pr44
   r2 -> pr55
   ...

Now let us augment with the actual instruction producing the result - but without performing the optimization

   ...
   r1 -> pr11
   r2 -> pr22
   r2 := r1 + 1
   r1 -> pr11
   r2 -> pr33 pr33 := pr11 + 1
   ...
   r1 := r2 + 1
   r1 -> pr44 pr44 := pr33 + 1
   r2 -> pr33 pr33 := pr11 + 1
   ...
   r2 := r1 + 1
   r1 -> pr44 pr44 := pr33 + 1
   r2 -> pr55 pr55 := pr44 + 1

Now let us imagine that renamer logic is enabled to substitute the input "value" of a register - the instruction - in its ouutput expression

   ...
   r1 -> pr11
   r2 -> pr22
   r2 := r1 + 1
   r1 -> pr11
   r2 -> pr33 pr33 := pr11 + 1
   ...
   r1 := r2 + 1
   r1 -> pr44 pr44 := pr33 + 1 == (pr11 + 1) + 1
   r2 -> pr33 pr33 := pr11 + 1
   ...
   r2 := r1 + 1
   r1 -> pr44 pr44 := pr33 + 1
   r2 -> pr55 pr55 := pr44 + 1

When the [[dynamic instruction rewriting]] logic sees a value of tyhe form
   (pr11 + 1) + 1
it performs the associative and commutative transformations, to produce optimizations
   => pr11 + 1 + 1
   => pr11 + 2

and overall

   ...
   r1 -> pr11
   r2 -> pr22
   r2 := r1 + 1
   r1 -> pr11
   r2 -> pr33 pr33 := pr11 + 1
   ...
   r1 := r2 + 1
   r1 -> pr44 pr44 := pr33 + 1 == (pr11 + 1) + 1 == pr11 + 2
   r2 -> pr33 pr33 := pr11 + 1
   ...
   r2 := r1 + 1
   r1 -> pr44 pr44 := pr33 + 1 == pr11+2
   r2 -> pr55 pr55 := pr44 + 1 == pr11+3

and so on.

Actually, nothing in the above description depends on the [[dynamic instruction rewriting]] logic being in the renamer pipestage.

If it is in the renamer pipestage, and there is no additional delay for the optimization, we would never see a multi-stage transformation
  pr55:=pr44+1 == (pr33+1)+1 == ((pr11+2)+1) == pr11+3
i.e. if the [[dynamic instruction rewriting]] logic at the renamer pipestage were fast enough, only the optimized version would propagate.

However, if the [[dynamic instruction rewriting]] logic were in the out-of-order execution system, we might see such multistage transformations.
(Also, if renamer pipestage logic had a delay.)

Indeed, we might place unrewritten instructions into the out-of-order scheduler,
and trigger a rewrite when one of the operands arrives
- but not the other, without waiting for the second or other operands to arrive.
I.e. this [[dynamic instruction rewriting optimization]] can take advantage of [[operand inter-arrival time skew]],
indeed, of [[dynamic operand inter-arrival time skew]], when the operand arrival times cannot be estimated in advance.

(Another class of optimizations can be performed when you can estimate [[operand inter-arrival time skew]].)

Consider
   ...
   r1 := load ( ... ) // cache miss, unpredictably randomized arrival time
   r2 := load ( ... ) // cache miss, unpredictably randomized arrival time
   r3 := load ( ... ) // cache miss, unpredictably randomized arrival time
   r4 := r1 + r2
   r5 := r4 + r3
   ...

Let all of these operations be placed into an out-of-order scheduler.
(The renamer for this is trivial.)

Now let us assume that the r3 load completes early, r3:=C

   ...
   r1 := load ( ... ) // cache miss, unpredictably randomized arrival time
   r2 := load ( ... ) // cache miss, unpredicataly randomized arrival time
   COMPLETE C <= r3 := load ( ... ) // cache miss, unpredicatably randomized arrival time
   r4 := r1 + r2
   r5 := r4 + (r3=C)
   ...

Now let us assume that the r2 load completes early, r2:=B

   ...
   r1 := load ( ... ) // cache miss, unpredictably randomized arrival time
   COMPLETE B <= r2 := load ( ... ) // cache miss, unpredicataly randomized arrival time
   COMPLETE C <= r3 := load ( ... ) // cache miss, unpredicatably randomized arrival time
   r4 := r1 + (r2=B)
   r5 := r4 + (r3=C)
   ...

Then let us "fire" the partially complete instruction
   r4 := r1 + (r2=B)
producing

   ...
   r1 := load ( ... ) // cache miss, unpredictably randomized arrival time
   COMPLETE B <= r2 := load ( ... ) // cache miss, unpredicataly randomized arrival time
   COMPLETE C <= r3 := load ( ... ) // cache miss, unpredicatably randomized arrival time
   r4 := r1 + (r2=B)
   r5 := r4 + (r3=C) == (r4=(r1+(r2=B))+(r3=C))== r1 + (B+C)
   ...

where B+C is a constant that can be evaluated, and written into the scheduler entry for r5:=.

Therefore, when r1 completes, both
   r4 := r1 + B
and
   r5 := r1 + (B+C)
can be evaluated in dataflow graph depth 1 - rather than the two if not rewritten.

Krazy Glew's Blog