The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Wednesday, August 04, 2010


Attempting to describe some microarchitectures in text-only email to a friend inspires me to post this. It would be much better to have lots of drawings illustrating precisely what I am talking about; but this textual notation is fairly compact, can be understood by some people, and in some ways, because it underspecifies certain configurations, allows us to talk without getting bogged down in multiple different ways of interconnecting the blocks. I.e. sometimes drawing is TOO specific.

= Basic Pipelines =

Let's start out with in-order pipelines
and out-of-order pipelines

A typical in-order pipeline might look like
 InO = BP -> I$ -> D -> RD -> X -> L1$ -> WB
with pipestage names
* BP = branch prediction
* I$ = instruction cache
* RD = register read
* D = decode
* X = execute
* L1$ = L1 data cache
* WB = writeback

A prototypical out-of-order pipeline might look like
 OoO = BP -> I$ -> D -> RR -> S -> X -> L1$ -> IW
with additional pipestages
* RR = register rename
* S = schedule
* IW = instruction window - ROB, etc.

Where are the PRF reads?

On a P6-style read-before-schedule & capture machine:
 OoO = BP -> I$ -> D -> RR -> RD -> S -> X -> L1$ -> IW

On a HaRRM (1987) or Wmt-style read-after-schedule machine:
 OoO = BP -> I$ -> D -> RR -> S -> X -> RD -> L1$ -> IW

I am not going to go into details about this, or the ROB/PRF, because that's not my topic of discussion here.

Actually, the execution units may be superscalar, or they may share the same single port.
I could contrive ways of drawing this, but again it's not my real point.
Similarly, the L1$ is often accessed in parallel with the first few cycles of X.
I might depict this as
  X || L1$
or just
  X L1$
rather than
  X -> L1$

In fact, eliding the arrows overall makes things less cluttered
A typical in-order pipeline might look like
 InO = BP I$ D RD X L1$ WB
 OoO = BP I$ D RR S X RD L1$ IW

and it allows us to throw other stuff on, such as the L2$
 InO = BP I$ D RD X L1$ WB       L2$
 OoO = BP I$ D RR S X RD L1$ IW    L2$

we can stretch the notation to indicate store queues, etc; it is nice to use the DEC/AMD term LSU, even though Intel tends to split them up

= SMT =

SMT, symmetric multithreading, like Intel Hyperthreading, runs multiple threads on the same out-of-order pipeline.
 SMT2(OoO) = SMT2(BP I$ D RR S X RD L1$ IW)    L2$
Not indicated: the fact that I$ is shared, but renamer is replicated, etc.

We might also have SMT in-order machines, like Intel's first generation Atom:
 SMT4(InO) = SMT4(BP I$ D RD X L1$ WB)       L2$

By comparison, we can emphasize the single threaded nature of the orignal pipelines:
 ST(OoO) = ST(BP I$ D RR S X RD L1$ IW)    L2$
 ST(InO) = ST(BP I$ D RD X L1$ WB)       L2$

= Multicluster =

The original multicluster microarchitectures that I am familiar created wider superscalar machines by replicating the execution units.  They were multicluster in that they did not have uniform bypassing: some might bypass within the cluster fast, but impose an extra cycle bypassing to other clusters; some might not permit bypassing at all, except perhaps through special inter-cluster instructions.

We can attempt to draw this in ascii art as

 MC(OoO) = BP I$ D RR S X RD L1$ IW    L2$

 MC(InO) = BP I$ D RD X L1$ WB         L2$

but I will instead draw it as

 MC(OoO) = BP I$ D RR S MC2(X) RD L1$ IW    L2$
 MC(InO) = BP I$ D RD MC2(X) L1$ WB         L2$
The question always arises as to how much is replicated in the clusters:
 MC(S X)
 MC(S X RD L1$)
This last will be revisited in MCMT.

Execution units can be shared between clusters, like the FPU in AMD Bulldozer.
It's a pain to depict this in ascii art, so I won't bother here.

Should the renamer be inside the clusters?  Depends. It's easier to maintain single thread semantics if the renamer is outside, and similar the LSU:

 MC(OoO) = BP I$ D RR MC2(S X RD L1$) LSU IW    L2$

But since the S X LSU unit is as critical as the S X L1$ loop, it is natural to put it inside. But then coordinating the two clusters becomes a pain.  You might, e.g., copy all stores to both clusters' LSU.
I don't know how to draw that in ascii art.

 MC(OoO) = BP I$ D RR MC2(S X RD L1$ LSU) IW    L2$

You start down the slippery slope of complexity if you have an L1 LSU iside the cluster for speedd,
and an L2 LSU outside the cluster to coordinate things.
Note that the LSU2 need not have any data.
 MC(OoO) = BP I$ D RR MC2(S X RD L1$ LSU1) LSU2 IW    L2$

We'll assume that IW is outside the cluster for single thread

= MCMT =

Once you have multiple clusters, you can run multiple threads on them.
One thread per cluster is natural. It permits the nice simplification of not having to do bypasses between clusters.

 MCMT2(OoO) = BP I$ D RR MCMT2(S X RD L1$) IW    L2$

With separate threads, an LSU inside the cluster is natural
 MCMT2(OoO) = BP I$ D RR MCMT2(S X RD L1$ LSU) IW    L2$
but the stuff talked about above may apply.

For separate threads, it might be natural to put IW inside the cluster.  But IW, instruction window management such as ROB, is not in the critical loop.  If you put IW outside, you might be able to share non-uniformly, e.g. giving simple threads 1/4 of the IW, and complex 3/4.
I call one way of doing this a [[segmented sequential instruction window]].

= MCMT and SMT =

You could make each cluster itself SMT:


although now you start driving the ifetch prety hard.

= Speeding Up Single Threads =

SMT and MCMT are ways of supporting multiple threads. It would be nice if they could speed up single threads.

MC is used for single threads.  However, it is a challenge to do scheduling for truly decoupled MC microarchitectures.  One additional cycle of bypassing can be afforded, but it gets harder the more decoupled the clusters are.

My old, old, name for using multiple threads to speed up a single thread was IMT, implicit multithreading.

SpMT, speculative multithreading, is the form of IMT I am most familiar with,
There are other forms of IMT - more fine grained than SpMT, with executes contiguous sections of the instruction such as loop bodies and regions separated by CALLs and RETurns.

IMT and SpMT need to run on a multithreaded substrate.
Above we have described two multithreaded substrates
with the obvious addition of
Any of which can be InO or OoO. MCMT can be built with SMT inside the clusters.
And MP, multiprocessor, separate cores, can be built out of single threaded cores, multithreaded cores, etc.
 MP =

You can therefore run, or at least think about running, IMT or SpMT on any of these multithreaded substrates:

Overall, SpMT(MP(ST)) is hard.  It is hard to overcome the inter-core latencies when forking threads.

MP(SMT) is doable - indeed, it is Haitham Akkary's DMT.
However, SMT is often limited to only 2 threads, which really does not give you much opportunity to
speed things up with SpMT. And the speculative threads tend to interfere with the non-speculative thread. First, do no harm.
::SpMT(SMT2) - low potential

I motivated MCMT as a way of building a multithreaded microarchitecture for SpMT, that might be able to support more threads.  Alternately, I motivated MCMT by itself, without SpMT, as something intermediate between SMT and M.

SpMT(MCMT(ST)) is okay, but there is still probably overhead to go between clusters, felt on forking.
By the way, SpMT(MCMT(ST)) probably has lots of other stuff added to preserve sequential semantics.
Won't discuss that here, although I will mention that I am found of log based execution.

SpMT(MCMT(SMT2)) is a potential sweet spot.  I think the normal operating mode would be with one thread per cluster, but the SMT2 allows a "run thread" to be used to fork off the non-speculative thread, or a less speculative thread.  The runt thread could cycle steal transferring state from cluster to cluster, and thus minimize slowing down the non-speculative thread.

SpMT(MP) is hard, as mentioned above.

But SpMT(MP(SMT2)) is another sweet spot - with the runt thread trick again played to help forking.

So, we have two attractive configurations:
with the inner SMT2 used to speed up forking.

SpMT(MCMT(SMT2)) is probably smaller,
but  SpMT(MP(SMT2))
makes better use of the many, many, cores we have nowadays.

Combining both in
is imaginable,
but it is more complex than doing just
Not worth doing unless there is some big win.
Or unless you want to fill a product line hole with MCMT(SMT2)
- e.g. something larger than a single core ST or SMT2(St), but smaller than MP2.

= Multilevel Schedulers, etc. =

TBD: S2 S1 schedulers, etc.

TBD: batch scheduling across clusters versus fine grain de-interleaved scheduling.

= See =

Wiki at