The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, February 27, 2011



There used to be a big debate in computer architecture about whether
one should have [[Unified versus Split Integera and Floating Point]].
Both ISA or macroarchitecurally, in the programmer exposed register
file. Or microarchitecturally, in the implementation.

E.g. the [[Intel P6 microarchitecture]] implemented the [[Intel
x86/x87 ISA]], with separate integer and floating point. But the
original P6 microarchitecture implemented this is a unified datapath,
in which both integer and floating point use the same scheduler, and
occupy the same [[PRF]] in the form of the unified [[ROB]], each entry
of which can hold a full 80+ bit x87 FP value. But they are split at
the [[RRF]].

Over time, Intel added [[SIMD packed vector extensions]]. [[MMX]]
shared the x86 registers, both architecturally and
microarchitecturally (in most implementations). But wider SIMD, such
as SSE with 128 bit [XMM]] regisyyers, and AVX with 256 bit [[YMM]]
registers, introduced yet another split architectureal register file.
Microarchitecturally, they may have kept the P6 unified structure, but
the wastage of storing an 8, 16, 32, or even 64 bit scalar in a 128 or
256 bit wide PRF entry becomes more and more of a concern.

AMD, by contrast, has typically had separate integer and FP/SIMD
clusters, with separate schedulers and PRFs and ...

Intel's 2011 processor, Sandybridge, has split integer and SIMD PRFs,
a unified scheduler, but split scalar and SIMD clusters.

It is becoming increasingly obvious that any desire to split datapaths
should be based not on type, not on integer versis floating point, but
should be based on width, intrger or scalar versus SIMD packed
vectors. Narrow data types, 8, 16, 32, 64 bits, versus wide packed
data types, 128, 256, possibly even 512 bits wide. Latency versus

Currently it is pretty standard to separate

* integer, scalar, 8, 16, 32, and 64 bit data. Addresses and branches. Latency sensitive

* SIMD packed vector, both integer and FP. 128, 256, and possibly 512 bit data. Bandwidth.

The only real question is whether one should support scalar FP on a
datapath optimized for low latency, and interaction with branches and
addresses. Or whether scalar FP should only be handled as a single
lane in a vector register, as in SSE, in a throughput manner.

Now, there will always be workloads where FP scalar latency will
matter. For that matter, there will always be workloads where vector
latency matters. Nevertheless, this division into narrow latency and
wide throughput datapaths seems to be a trend.

It will always be desirable to be able to access the bits of an FP
number, whether scalar or vector, in an ad-hoc manner. It will always
be desirable to branch and form addresses based on such accesses. But
these desires will not always outweigh the separation into narrow
latency and wide throughput clusters.

Power concerns now are paramount. Split datapaths naturally save
power; but unified datapaths may be able to save power in an even
finer grained manner. Much depends on the granularity possible with
[[clock gating]] (which can be quite fine grained, e.g. to bytes of a
datapath) and [[power gating]] (which in 2011 is quite coarse grained,
unlikely to be done within the width of a unified datapath). For that
matter, separate narrow/latency and wide/throughput clusters would
probably be easier to clock at different frequencies, with different
coltages - although as of 2011 this does not seem yet to be done.

I can make no hard and fast conclusion: although there is a trend
towards split narrow/latency and wide/throughput datapaths, there will
still be occasional winning attempts to unifiy. Much depends on the
evolution of power, frequency, and voltage management, in terms of

Why non-inclusive/non-exclusive caches?

Sheesh! I can't usually take all the time this article needed to write in one sitting.



[[Non-inclusive/non-exclusive (NI/NE)]] caches are used in
[[multilevel cache hierarchies]] that are, duh, neither [[inclusive]]
nor [[exclusive]].

I.e. there is no enforcement of either the [[cache inclusion
property]] nor the [[cache exclusion property]]. A cache line in an
inner cache may or may not be in an outer cache. You can't be sure.
If necessary, you must check.

Because there is no [[cache inclusion]], both the inner and outer
caches must be checked, i.e. snooped or probed, when responding to
requests from other processors.

= Historical Background - Accidental Inclusion in the Original [[P6]] =

Probably the most widely known processor that has an [[NI/NE]] cache
is the [[Intel P6 family]].

The original P6 had an external L2 cache, on a [[back-side bus]]. In
an ascii diagram (TBD, I need drawing in this wiki):

[[L2$]] <---> P([[I$,L1$]]) <---> [[FSB]]

In more detail

[[L2$]] <---------> [[SQ]] <---> [[FSB]]
| | |
v v v
[[I$]] [[P]] [[L1$]]

Most of the symbols used in the diagrams above are standard; as for
those that are not:

* [[SQ (super queue)]] - on the [[P6 Pentium Pro]] die, coordinated interaction with the [[Front side bus (FSB)]], the back side buss and the off-chip [[L2]], and [[DCU]], mainly interacting with the [[fill buffers (FB)]].

* [[FB (fill buffers)]] - a staging area under the control of the [[data cache unit (DCU)]] of the P6 cpu core, talking to the [[bus cluster]] and hence the [[SQ]], [[L2$]], and the outisde world via the [[FSB]], and the caches of the [[processor core]], the [[IFU]] and [[DCU]] arrays. Also talked directly to the processor core, to minimize latency.

The original P6 had a completely external [[L2$]]. Not just the
[[cache data array]]s, but also the [[cache tag array]]s.

Now, there were several reasons why the original P6 had [[NI/NE]] or
[[accidental inclusion]]:

== Backside L2$ ==

Because the L2$ was on a backside bus, it was not necessarily faster
to access than the L1 caches, the [[IFU]] and [[DCU]]. In fact, you
can easily imagine circumstances where it might have been slower to
[[snoop]] or [[probe]].

    Note; having the L2$ tags on-chip might have made it possible to probe the L2$ faster than the L1$.

== [[Deterministic Snoop Latency]] ==

One of the design goals of the original P6 was to have [[deterministic
snoop latency]] - to be able to respond within a relatively small
fixed number of cycles to an snoop from remote processors on the

[[Cache inclusion]]] would get in the way of this. You would first
have to prove the L2$, and then, if a hit, probe the L1$. This
naturally leans towards a [[variable snoop latency]], and slows the
snoop down.

In the long run I think that [[deterministic snoop latency]] must be
considered a worthy goal, but a dead-end. I think even the original
P6 gave up on it, providing it in most cases, but also provided a
snoop stall as a way of handling cases where it could not be provided.
Nevertheless, the goal of [[dererministic snoop latency]] influenced
the design.

== L2$ relatively large compared to the L1$ ==

The external, off-chip but inside package, L2$ for the original P6
Pentium Pro was 512Kib, 1Mib, with some versions as small as 256KiB.

The DCU or L1 data cache was die dieted over the life of the project
from 64KiB in our dreams to 16 or 8KiB.

Thus, the L2 was typically 32x bigger than the L1 data cache,
and similarly wrt the instruction cache.

Although Intel has experience with [[exclusive cache]]s from earlier
processors, with a 32:1 or 16:1 size ratio, there is little point in
using an exclusive cache to try to increase cache hit rate. The
increase in the hit rate from having circa 3-66% more data is small.
Most of the time the data in the L1$ is included in the L2$ buyy

This property Wen-Hann Wang called [[accidental inclusion]]. Wen-Hann
Wang was the leader of the P6 cache protocol design, and the inventor
(in his PhD) of [[cache inclusion]].

== Simplifying Design ==

In truth I must admit that we started P6 expecting to build an
[[inclusive cache]] hierarchy. However, Wen-Hann's performance data
showed that [[accidental inclusion]] because of the relatively large
L2$ removed most of the performance incentive for inclusion. And
moving to [[NI/NE]] considerably simplified the cache protocol,
removing many boundary conditions.

For example, how do you handle cases where a cache line is being made
dirty in the L1 exactly as it is being evicted from the L2? ... It is
not impossible to handle these cases. But [[accidental inclusion]]
made it just plain easier to leave them decoupled.

    By the way, at the time on P6 we did not really have a term for the cache inclusion properties of the P5 L2, or lack thereof. Wen-Hann called it [[accidental inclusion]], but that term has stuck only in Glew's memory. The term [[non-inclusive/non-exclusive (NI/NE)]] was created years later.

== [[Silent eviction of clean data]] ==

It is, of course, always necessary to perform a [[writeback]] when
[[evicting dirty data]].

When [[evicting clean data]], however, one can silently drop the clean
line. It is not always necessary to inform the outer cache.

Now, this isn't 100% linked to [[cache inclusion]] or [[NI/NE]]. One
can perform [[silent eviction of clean data]] in an [[inclusive
cache]]. However, doing so means that any [[inclusion tracking]] that
the outer cacheis performing, so that it knows which cachelines in the
outer cache are inside which inner cache(s), may be inaccurate.
I.e. [[possibly inaccurate inclusion tracking ]].

[[Accurate inclusion tracking]] is a [[slippery slope]]. It is not
necessary, but it has several nice properties. Worse, some people
implicitly assume [[accurate inclusion tracking]] in any [[inclsuive
cache]] design.

[[Accidental inclusion]] or [[NI/NE]] avoids this slippery slope
altogether. Which not only simplifies design, bit also avioids the
sort of incorrect assumptions that often arise on such a [[slippery slope]].

== Allocation in L1$ without L2$ interaction ==

[[Accidental inclusion]] enabled or made easier certain cache protocol
optimizations, in particular [[eliminating unnecesary RFOs]] for
full-line cache writes, without requiring L2 involvement.
Especially as used in [[P6 fast strings]].

This probably was not a design goal for most of the people involved,
but it was something I (Andy Glew) was happy to take advantage of.

The basic observation is that a sizeable fraction of all cache lines
are completely overwritten, without any of the old data being read.
Therefore, reading the old data via an [[RFO (read for ownership)]]
bus request is wasted bandwidth.

In operations such as [[block fill]], e.g. zeroing a page of memory,
50% of the data bus bandwidth is wasted by such unnecessary RFOs. In
[[block copy]], 33%, (Assuming all miss the cache - not that bad an
assumption for an 8 or 16Hob L1$, but also not unreasonable even on
many larger caches, for many [[block memory operations]].)

Even in non-block memory code, unnecessary RFOs occupy a significant
fraction of data bus bandwidth. I have measured speedups of 10-20% on
many simulated workloads, and even on a few [[FIB-edited chips]].

    The biggest challenge with [[eliminating unnecessary RFOs]] is maintaining [[memory ordering]], in particular [[store ordering]]. It is easy to do on a [[weakly ordered]] system. I have invented several techniques for maintaining memory ordering, some of which are [[NDA]], but some of which can be discussed here. When I get around to writing them up. In my [[copious free time]].

The original P6 eliminated unnecessary RFOs using the [[deferred
invalidation cache protocol]] for [[fast strings]] [[block memory
operations]]. The bandwidth benefit was clear; unfortunately, the
overhead of microcode, in the absence of microcode branch prediction,
made fast strings not always the best approach. Since the original P6
[[fast strings]] have been hit and miss: when somebody in microcode
gets around to tuning them they can be a win, but if untuned they may
be a loss compared to the best hand rolled assembly code.

P6's [[WC memory type]] also eliminated unnecessary RFOs, but was not
allocated in the cache. Similarly the [[SSE streaming stores]] added
in later processors.

The [[deferred invalidate cache protocol]] continued in use all the
way to Intel's Nehalem processor, the last of the P6 family. Nehalem,
however, added an inclusive [[LLC]] with snoop filtering, killing the
deferred invalidate cache protocol.

So much for history. The point remains: not having an inclusive cache
enables certain cache protocol optimizations, such as [[eliminating
unnecessary RFOs]], that are much harderto buildin an inclusive cache

== Optionally Removing the L2$ ==

The [[NI/NE]] cache structure makes it easier to optionally disable or
remove the [[L2$]]. This option has been explored several times.

Now, it is possible to do this with an inclusive cache structure by
"faking out" the absent L2$, always returning "Yes, the cache line is
included in all possible internal caches", such as the instruction and
data and other caches.

But it is very easy, in an inclusive cache hierarchy, to slip dfown
the slippery slope and assume that the L2$ is present. And then to
take advantage of the L2$ being present in a way that breaks when the
L2$ is absent. [[NI/NE]] seeems to avoid this.

== Enabling [Funky Caches]] and Decoupling Associativity ==

In an [[inclusive cache]] system, one tends to want the associativity
of the outer, inclusive, cache to be greater than or equal to the sum
of the associativities of the inner caches it "covers".

E.g. if you have a 16 way associative outer cache that is inclusive of
8 inner caches, all of 4 way associativity, and if the usual indexing
of the cache using bitfields is performed (which usually implies a
1:many relationship between inner sets and outer sets), then it is
possible that for some set So of the outer cache mapping to sets Si of
each of the inner caches, each inner cache may want to have a distimct
line in each of the 4 ways. Implying that the inner caches may "want"
to have 8*4 distinct lines. Which exceeds the 16 way associativity of
the outer cache.

[[NI/NE]] caches do not have this constraint on associativity.

Furthermore, when playing the [[Funky Cache Rag]], exploring
structures such as [[victim caches]] or [[prefetch caches]] or
specialized caches for different data types, [[cache inclusion]] once
again gets in the way. Often these [[Funky Caches]] are fully
assiociative. They do a great job of reducing glass jaws related to
cache associativity - e.g. a 16 or 32 entry victim cache makes a 4 way
cache look like it has a much higher effective associativity. But a
32 entry fully associative victim or prefetch cache may only be able
to use 16 entries, if all entries map to the same way in the outer
cache. [[Cache inclusion]] prevents some of the worst [[glass jaws]]
that these caches can fix, from being fixed.

Now, these interactions are not necessarily large, at least in average
effect, although they can be a fertiole source of [[glass jaws]].

However, they do couple the design of the outer cache and the inner
cache. They mean, for example, that the outer cache design team
cannot simply change associativity without checking with the inner
cache design team, which may be making assumptions based on the
earlier outer cache design team [[POR]] for associativity. It means
that the outer and inner caches are more tightly coupled. Which can
lead to more efficient designs. But which can also make it slower and
harder to innovate.

== Backwards Invalidation Traffic ==

There are two main methods to maintain [[cache inclusion]]:
[[backwards invalidation]], and the considerablly less practical
[[outer cache victim identification by inner cache]]. I will ignore
the latter.

[[Backwards invalidation]]s are simple and straightforward. But they
also consume power, affect performance on an outer to inner cache
interconnect that may be saturated. And, in any case, are an
incremental source of complexity.

[[NI/NE]] cache hierarchies eliminate [[backwards invalidation]]s.

Now, in the 2011 Intel processors that motivated this article,
Sandybridge, with an inclusive snoop filtering [[LLC]] but a [[NI/NE]]
[[MLC]], backwwards invalidations are still required for LLC
evictions. But not for MLC evictions.

Let's talk some scenarios:

In workloads that use mainly read-only data, [[NI/NE]] allows [[silent eviction of dirty data]], but [[cacheinclusion]] requires BIs.

In workloads that use a lot of shared data, there may be multiple BIs for every inclusive cache eviction.

Etc. etc.

It is by no means certain that power saved by not [[probe filtering]] is not outweighed by the power saved by eliminating BIs.
It all depends on the type of multiproceessing in the system.
If the coherency traffic from beyond the outer cache is small relative to the cache thrashing inside the system,
e.g. if you have 16 cores crunching on large data serts, but no other multicore CPU chips and relatively small I/O,
BI traffic may outweigh snoop filttering.
But if you are building a very large shared memory system with many other CPUs and much cache coherent I/O traffic,
then inclusion may win.

= Are [[NI/NE]] caches still important? =

Obviously, since [[NI/NE]] caches are still in use in the Intel
Sandybridge processors being released at the time of writing in 2011,
they are still important in a practical sense? But is this just
[[legacy computer architecture]] dating back to the original P6? If
they were desiged from scratch, would [[NI/NE]] still be built

Maybe... lacking access to full simulation results, it is hard to say.

However, many of the advantages of [[NI/NE]] relate to making design
easier, decoupling inner and outer cache design. This is probably one
source of the longevity of the [[P6 microarchitecture]], which is now
coming to an end with Sandybridge.

Certainly, the old advantage of [[eliminating unnecessary RFOs]] no
longer stands. But that was always one of the most minor advantagesof

Much depends on the system. [[Probe filtering]] [[coherency traffic]]
makes a big difference in large multiprocessor systems with many outer
caches. It also helps with large amounts of high bandwidth cache
coherent I/O, as sometimes arises with ill-tuned [[discrete GPU]].
(Most discrete GPUs are tuned to use [[non-cache coherent I/O]], if
accessing [[UMA]] or processor memory at all.)

But as multicore systems get more cores,
and GPUs are largely integrated,
there may be less bandwidth on the other side of the [[outermost cache]].

== Where does cache inclusion win? ==

=== Snoop or probe filtering ===

One big win for cache inclusion is [[snoop filtering]].

However, such [[probe filtering]] does not necessarily require that
the cache data be inclusive. A data-less [[probe filter]] or
[[ownership cache]], a structure that allows ownership to be tracked,
and which permits lines to be exclsuively owned even though in main
memory may provide the benefits of [[probe filtering]] without the
costs of cache inclusion on a relatively small data cache. At the
cost of a new [[hardware data structure]], the [[ownership cache]].

Basically, the ownership cache maintains the cache inclusion property,
not necessarily the data cache. Since the ownership cache maintains
only a few state bits per cache line, the ownership cache can covedr
much more memory than an ordinary data cache. It is therefore less
liable to the issues of cache associativity.

=== Power management ===

Probably the most important win for [[cache inclusion]] is wrt power
management - really just an elaboration of [[snoop filtering]].

If you have [[cache inclusion]] you can power down all of the inner
caches, putting them into a low power mode that maintains state, but
which does not snoop.

Whereas if you have a [[NI/NE]] cache, you either must keep snooping
the inner caches, or you must flush the inner caches when going into
such a [[no-snoop cache mode]].

Now, you always needto be able to do such cache flushes. But they
take time, and if they are not required you can enter tye low power
mode more quickly, and more often, and thereby save more power.

Note that an inclusive LLC in combination with a NI/NE [[MLC]] and [[FLCs]]
gives the best of both worlds in this regard.
Note also that a datalerss [[snoop filter]] also works.

= Conclusion =

Many people, taught in school about [[inclusive caches]], are
surprised to learn how common and important [[exclusive caches]] and
[[non-inclusive/non-exclusive caches]] are, or at least have been, in

It is by no means clear that any approiach is always best. All 3
approaches have pros and cons, and win in some situations.

Cache replacement - tweaks and variations

{{Terminology Term}}

[[Cache replacement]] or [[cache victim selection]] is the method one uses to choose which cache line will be overwritten or replaced, i.e. victimized, when a new cache line is pulled in.

Common algorithms include
* [[Random replacement]]
* [[LRU (least recently used)]]
** approximations to [[LRU]:
*** [[pseudo-LRU]] or [[tree LRU]]
*** [[the clock algorithm]]

= Minor Tweaks =

== When is the cache [[LRU information]] or [[cache usage information]] updated? ==

E.g. do you update it speculatively, for instructions that may be on a [[branch misprediction wrong path]], thereby corrupting the LRU?

Should you update at retirement? (Probably not, but just in case.)
:By the way, here is a reason to update at retirement: so that you get more [[deterministic behavior]], which eases validation. However, this only works if you do not do [[speculative cache misses]].

Perhaps you should only update it for slightly speculative instructions:
* e.g. do not update for highly speculative [[SpMT]] threads?
* e.g. update only for isntructions that have not retired, but for which all earlier branches have been resolved. (Such instructions are stil speculative - there may be a page fault or exception or interrupt - but they7 are only [[slightly speculative]]).

== What requests update the LRU information? ==

Should all requests, reads and writes, update the LRU information equally?

Many have proposed [[non-temporal hint bits]] in instructions, to say "ignore this access".

It is an open issue whether prefetches, whether initiated by [[prefetch instruction]]s or by [[hardware prefetchers]],
should update the LRU.

Multilevel caches:
* update the LRU on all accesses
* on misses
* on [[LRU leak-through]] from the inner cache
* only on capacity evictions from the inner cache (suggested in comp.arch by EricP on 2/25/2011)

By the way, the idea of updating the cache usage information only on capacity evictions from the inner cache
exposes several issues:

    Updating the LRU information and advancing the LRU pointer are two different issues. Invalidation traffic may result in several coherency cache misses between capacity misses. It would be bad to keep choosing the same victim. Oftentimes coherency misses do not need a victim chosen: they fill into the empty or stale or non-present line left behind by the invalidation. (May not happen in all systems.) Dirty writebacks naturally notify the outer cache of a capacity replacement. However, replacing a clean line may not naturally require such notiification: i.e. we may have [[silent replacement of clean cache lines]]. This may require [[LRU leakthrough]] so that the outrr cache can track.

== When is the victim chosen? ==

Should you choose the victim at the time the cache is missed,
or at the time the data to be placed in the cache line has returned?

It may not matter on an in-order machine. However, on an out-of-order machine, chooising the victim early may raise isssues, such as what should happen if too many cache misses to the same set occur
- i.e. if the victim itself needs to be victimized before the first victim's replacement has arrived.

One advantage of choosing the victim early is that you may be able to send the data for the [[cache line fill]]
returning directly to where it belongs in the cache - you may not have to stage it through a [[fill buffer]],
and you may be able to avoid fairly expensive [[fill buffer forwarding]] logic.
I.e. you may be able to have [[data-less fill buffers]].

    Half baked Idea: choose the victim early. But also allocate a fill buffer. At [[fill time]], determine if the early victim choice is still accurate. If not, write to the fill buffer. (I call this a half baked idea because it doesn't really solve the problem of wanting to avoid data fill buffers. Elaboration: choose the victim early. Allocate a [[address and control fill buffer]], but do not allocate a [[data fill buffer]]. If the victim is thrashed, allocate a [[data fill; buffer]] (which may simply be a [[spill buffer]] allocated circularly. At fill time, choose.

== Biasing the victim choice ==

* Prefer invalid lines, rather than replacing valid lines.
* Prefer clean lines to dirty lines
:: Thereby avoiding [[dirty writeback]] traffic.

* in multilevel caches, prefer lines that are in none of, or the fewest, inner caches

* prefer to replace lines containing data to lines containing instructions
** or vice versa - although oftentimes [[I$]] misses are more expensive than [[D$]] misses

* extend the above to more data types - integer versus FP (FP is often accessed in a cache thrashing manner)

= Better cache replacement algorithms =

There has been much work on trying to adjust these algorithms.

Ideally, we know that [[Belady's algorithm]] is optimal for many assumptions.
It amounts to "replacing the cache line that will be used furthest in the future".
It must be adjusted when there are non-uniform costs.

    There have been attempts to [[approximate Belady cache replacement using speculation and lookahead]]. E.g. do not replace a line that is used in an instruction window. This works better when you have a really large instruction window, such as might be provided by [[SpMT]]. It also may not be necessary if the [[LRU bits]] or [[cache usage information]] is updated speculatively. Others have attempted to unroll memory access pattern predictors, to get a list of predicted accesses, against which a Belady query can be made

See [[Victim Choice for Multilevel and Shared Caches]] for a discussion of issues in multilevel cache victim selection.
One of the main issues is that in a [[multilevel cache hierarchy]] the LRU bits of the outer caches are not adjusted
by accesses to the inner caches, so choosing a victim based purely on LRU bits updated only by accesses sent to the outer cache
is often not good.
Many proposals [[leak-through LRU]] information, to allow the outer cache to track the inner LRU.

It is well known that certain access patterns are not well suited to LRU cache replacement.
For example, circularly accessing N lines, in a cache of M lines, M < N, is better suited by MRU cache replacement than LRU.
Many have proposed to exploit this, e.g. by [[non-temporal]] hint bits attached to instructions,
or by predictors that attempt to identify such non-temporal cache access patterns.

Reading List for Computer Architecture (seed)



I am often asked for a reading list for computer architecture.

Unfortunately, there aren't many good books in computer architecture.

= [[Hennessy and Patterson]] =

[[Hennessy and Patterson]] is the current must-read, since its first publication
but that is not really a comprehensive book or survey of the field,
as a text book suitable for classes, espousing a quantitative design philisophy.

In fact, one of the reasons I am trying to create the comp.arch wiki is as an antidote to Hennessy and Patterson.
I want to collect a survey or grab bag of many techniques not mentioned in [[H&P]],
or potentially deprecated by them.
Many students raised on H&P, particularly the early versions, developed a common attituide
- they were [[RISC bigot]], and deprecated any advanced microarchitecture such as [[OOO]],
since they felt sure that the [[simple 5-stage pipeline]] was the be-all and end-all.
It is not fair to ascribe these viewpoints to Professors Hennessy and Patterson
- their views are much more nuanced
- but nevertheless this viewpoint was common.

= Random Grab-bag =

This list was assembled from a recent email to an undergraduate (2011).
TBD: collect past recommended reading lists.

Hennessy and Patterson is required reading. However, I don't much like any of the textbooks - which is why I am working on my wiki, sort of as an antidote.

You pretty much have to read the conferences and journals:
* [[ISCA]]
* [[ASPLOS]]
* [[Micro]]
* [[HPCA]]
* [[Hotchips]]
* [[ISSCC]] - usually lower level than computer architecture, but oftentimes the only information you can get about industry chips

I suggest reading every issue of [[Microprocessor Report]] you can, although I warn you that there was a period when MPR was owned by EE Times when it was not very good. Back in the 80s and 90s, MPR was the bible, by installments; and since The Linley Group bought it back it has become much better than it was when EE Times owned it.

* the old [[Microprocessor Forum]] conference

As for reading the conferenves and journals, a good place to start would be the "Best of" issues:

* 25 Years of ISCA, Selected Papers, ed Guri Sohi

* 20 years of ACM SIGPLAN conferencvve on PLDA, 1979 - 1999, ed McKinley.

Other books on my shelf:

* Mead Conway VLSI - old, but classic. Read it.

* Laura Prince, Semiconductor Memories -old; I wish there was a better book on DRAM. Probably better to read papers.

* Bell and Newell,
* Bell Sieworik and Newell,
:: Computer Structures
::- this was the classsic when I was in school. Very old now.

* Mike Flynn, Computer Architecture
* Blauuw and Brooks, Computer Architecture

* Pfister - In search of clusters.

Many many instruction set manuals

* the IBM 360 Principles of Operation

* Hacker's Delight

* Inside the AS/400

Organick's books on computer hardware
* TBD: list

* The Soul of a New Machine

Use a library and used book store. I bought, and still buy, a lot of 2nd hand books.

* read manuals on http://bitsavers.org

= Best of Lists and Collections =

* papers that won the ACM SIGARCH and IEEE-CS TCCA ISCA Influential Paper Award, http://www.sigarch.org/influential_paper.html

= Bibliography Surfing =

Many students, especially graduate students assemble personal bibliographies and reference lists of papers they encountered in their research.
The bravest students may annotate the references, with discussions of value.
Many such lists can be found on the web.
Use them shamelessly for inspiration - but beware that your opinions may not agree with the bibliographers./

Unfortunately, more experienced professionals, professors, etc.
have usually learned to only provide positive recommendations,
and to never make public anti-recommendations
about authors who may eventually be recommending funding for the reviewer.

Still more experienced professionals may recommend papers that they dislike,
because not mentioning them would be an obvious dis.

Monday, February 21, 2011

Pattern History Invariant Branch Prediction

    Taking a vacation day - actually, a so-called "floating holiday" on today, USA President's Day - so that I could be with my daughter, who is off school, and so that I could pick up my wife at the airport.

{{Terminology Term}}
[[Category:Branch Prediction]]

= Terminology =

Many branch predictors have a property that I call [[Pattern History Invariant Branch Prediction]].
By this I mean that the cannot change their prediction for a branch, e.g. from taken to not-taken, or vice versa, without having an intervening misprediction.

    Okay, I admit: "Pattern History Invariant" is a bad name. It is pompous, and not very intuitive. I welcome proposal for a different name. Something like "Branch predictor that can't change it's mind without a misprediction"?
      == Bad Joke == Oooh, how about "Can't change its mind without a fight"? I'm tempted to call it a Scottish, Irish, or a "My Wife" branch predictor - but I'll get in trouble for any and all of these, and only mention them because I am humor impaired.

= Why does this matter? =

== [[Timing Invariant Branch Prediction]] ==

Pattern history invariance can be a useful property, because it makes it easier to build
[[Timing Invariant Branch Prediction]].
Which has several good properties, such as making val

In particular, it means that delaying updating the pattern tables used to make predictions until retirement will NOT result in any more mispredictions than would otherwise occur
- any such mispredictions would already be on the wrong path, younger than a misprediction that would occur no matter what.

Delaying the pattern table update from, e.g. branch execution to retirement has the possible benefit that [[wrong path]] execution will not corrupt the branch predictor.

However, it may be pointed out that it is possible that there will be a performance loss because of other benefits of execution on the multiply speculative wrong path that speculatively updating a branch pattern history table may bring.
But to get this, you must have several nested effects:
* there must be a first branch misprediction pending
* a younger branch may update the predictor tables
* this may result in a misprediction underneath the first branch misprediction
* which may itself update the predictor tables
** possibly eliminating a misprediction when execution resumes after the first branch misprediction is repaired
** possibly producing instruction prefetch and other [[good wrong path effects]]

I.e. with [[Pattern History Invariant Branch Prediction]] there must be a misprediction to result in a change of prediction.
However, that misprediction may itself occur, and be repaired, speculatively.
And that misprediction may have other [[good wrong path effects]].

== [[BTB Unrolling]] ==

[[Pattern History Invariant Branch Prediction]]
also enables [[BTB Unrolling]]:
give a current (IP,history) tuple
and a separate pattern table used to make a prediction
one can "unroll" the predictor to produce a trace
of multiple, N, branch (from,to) addresses.
And such a trace unrolling is guaranteed to be as accurate as if the predictor were accessed one branch at a time.

= Examples of Pattern History Invariance =

= Examples of non-Pattern History Invariance =

I was tempted to say
    The most powerful example of a non-pattern history invariant branch predictor is a [[loop predictor]]. E.g. it may predict that the loop closing branch is taken 99 times in a row, and then not-taken the 100th time. Even more advanced loop predictors, such as predicting that if the last few loops were executed were 67,66,65, times, then the next will be executed 64 times, are even less pattern history invariant.
But this is not true: such a loop predictor will always make the same prediction, if the current loop count
is considered to be part of the history that used, along with the branch in question, to index
the pattern tables.

It is necessary to work ahrder to contrive branch predictors that are not pattern history invariant.

TBD: I believe that certain McFarling-style branch predictors may break the pattern history invariance property.
TBD: get the details right.

In particular, in a multi-predictor system such as McFarling,
if you update the predictors and choosers that are NOT being used as well as the predictors and choosers that are being used,
you can "mask" a misprediction in a second predictor,
if a first predictor was being used.
If the choice of which predictor should be used can then change in the pattern table so that
when the same (IP,history) pair is used at a later time the second predictor wins...

= See Also =

* [[Branch predictor state update]]
* [[Timing Invariant Branch Prediction]]

Timing Invariant Branch Prediction

{{Terminology Term}}
[[Category:Branch Prediction]]

By [[Timing Invariant Branch Prediction]] I mean branch prediction that is independent of timing.

E.g. a branch predictor design that, if you slowed down the clock or changed the pipeline, e.g. by inserting idle pipestages,
would not vary.

Such invariance is very convenient
(1) for validation
(2) for separation of concerns - it allows you to change the pipeline without worrying about its effect on branch prediction accuracy, etc.

However, many microarchitectures do not adhere to this design principle.

Tweaks and adjustments to the branch prediction microarchitecture may be necessary to attain [[Timing Invariant Branch Prediction]].
For example
* Per-branch history often leads to timing dependent branch prediction, which can be remedied by [[path history]]
* Updating the branch history tables can lead to timing dependent branch prediction

= Per-branch history often leads to timing dependent branch prediction =

For example, if you are using a [[per-branch history]] such as a [[TNT branch history]],
in many pipelines there are several clocks between instruction fetch,
i.e. delivering an instruction pointer to the instruction cache and branch predictor,
and instruction decode.
This means that an instruction fetch/branch prediction time you do not know where the branch instructions are;
you only know where the branch instructions are at decode time.

Therefore, unless you stall,
branch prediction may be using a [[per-branch history]] that is several cycles out of date,
and which may miss several branches
from what it would be ideally using if branches could be identified instantaneously.
Moreover, how many branches are missing may depend on timing, such as [[I$]] misses, pipeline stalls, etc.

== Path history enables timing independent branch prediction ==

If it is a problem to have timing dependent branch prediction
caused by per-branch history, this can be assuaged by [[path history]] branch prediction.

Instruction fetch does not necessarily know where the branches are. However, it necessarily does know the sequence of instruction fetch addresses.
If it is possible to create a path history suitable for use in a branch predictor,
e.g. by XORing the instruction fetch pointers,
then this [[path history]] is accurate and timing invariant.
Since XOR hashing is fast, this probably can be acheived.

However, XORing all fetch IPs may not be the best [[path history]] to use in branch prediction.
Creating a hash, probably XOR based, of [[branch from/to addresses]],
suffices to describe the path - although it losses information about branches between instruction fetch blocks.
Hashing such a [[branch from/to addresses]] [[path history]] with the current instruction fetch pointer
is about as good as you can do at instruction fetch,
without identifying individual instructions.

== Combining ... ==

[[Path history]] based branch predictors are usually reported as being more accurate than [[per-branch history TNT]],
so the pipeline adjustments above may help performance as well as providing [[Timing Invariant Branch Prediction]].

However, if they do not, you can obtain a hybrid that provides many of the benefits of [[timing invariant branch prediction]]
along with the possible improved accuracy of [[per-branch history]]:
* use [[path history]] at instruction fetch
* use [[per-branch history]] at the decoder, in a form of [[late-pipestage branch prediction]]

This gives you timing invariance,
but it also gives the [[decoder branch predictor]] the chance to make corrections to the earlier branch prediction.

= Branch History Update Time =

Q: when should the prediction tables, the [[pattern history table (PHT)]], also sometimes called the [[branch histogram]] or [[branch history table (BHT)]],
be updated?

At execution time, or at retirement time.

Updating at retirement time enables [[Timing Invariant Branch Prediction]]

Updating at execution time may cause only minor issues on an in-order machine.
On an out-of-order machine, however, updates may be done out of order.
In either case they may be done speculatively,
for branches that will not actually be retired because of earlier misspeculations.

Furthermore there arises the question of what history or [[stew]] is used to update the pattern history table.
If every branch at execution carries its history or stew with it, no problem.
But if a big complicated history is maintained only at retirement,
some processor designs have updated the pattern table for branches at execution
with a history corresponding to a position in the instruction stream several cycles before the branch.
Not necessarily a consistent number of cycles, either.

Updating the prediction tables a tretirement time seems to avoid these issues.

TBD: performance cost

TBD: latencies of table update - immaterial if [[pattern history invariant branch prediction]].

= Conclusion =

Is [[Timing Invariant Branch Prediction]] an absolutely vital design goal?

Not necessarily - if performance is increased by timing variant branch prediction, so be it.

However, [[Timing Invariant Branch Prediction]] is definitely a nice thing to have:
* it makes validation much easier
* it makes the design more robust, less fragile, less likely to break if you have to add a pipestage or stall late in the design cycle.
* it is usually associated with higher performance rather than lower performance branch prediction algorithms
* it can usually be achieved by a hybrid predictor design.

It is my experience that most timing dependent branch predictors
happened by accident, rather than by design:
* naive designers building a [[per-branch history]] out of a textbook
* naive extension of in-order designs to out-of-order exacerbating unnecessary timing dependence

[[Timing Invariant Branch Prediction]] is not necessarily a must-have,
but it is always a good thing to keep in mind when designing
your branch predictor and your microarchitecture/pipeline.
It is a pity to lose its benefits due to ignorance rather than deliberation.

Friday, February 18, 2011

CSE (Common Subexpressions) in HW andSW


CSE is an optimization that recognizes common sub-expressions, and saves them and reuses them rather than recalculating them.


var1 := (a*b) + c
var2 := (a*b) + d;

can be rewritten as

tmp := a*b
var1 := tmp + c
var2 := tmp + d

This is obvious, right?

It is less obvious when you have, e.g. complex addressing modes such as (base+index*scale+offset).
Should you CSE part of address computation?

Sometimes it is cheaper to recalculate than it is to store, allocating a register that may spill other registers.

[[Memoization]] is a form of hardware CSE, where an expensive execution recognizes recently performed calculations and avoids reperforming them.

[[1/x inverse instructions]] provide greater opportunity for CSE than do ordinary [[divide instructions]].

[[Micro-optimization_primitive_instructions]] provide more opportunity for [[CSE]]ing.

[[Hardware common-sub-expression elimination]] has been proposed - by me (Andy Glew), in my abandoned Ph.D. on [[instruction refinement]],
if by no one elsse. In fact [[nearly all Dragon-book compiler optimizations can be doone in hardware]].

[[Instruction reuse]] is a hardware technique that extenddsCSE dynamically over moderate to large distances.

Micro-optimization primitive instructions


By [[micro-optimization primitive instructions]] I mean instructions that expose the internal steps and partial results of what are otherwise considered to be primitive instructions.

One of the best academic and published examples is in the Dally paper referenced at the bottom:
[[Micro-optimization of floating-point operations]].
However, other examples have been found in actual instruction sets, particularly before transistor counts increased to the point where fully pipelined floating point [[FADD]] and [[FMUL]] instructions were common.

= Examples of Floating-Point Micro-Optimization Instructions (from Dally paper) =

Examples from Dally paper: (transcribed not quite exactly)

;Automatic Block Exponent
:Identifying the largest exponent is cascaded additions, aligning all mantissae and adding without normalization.
* Saves: power of intermediate normalizations and roundings, unnecesary shifts
* Cost: Precision, unless extra mantissae width.See also [[block floating-point]], [[superaccumulator]].

;Shift Combining
:An alternative to automatic block exponent used in casxes where order of operations must be preserved: avoids repeatedly shifting left for normalization, and then shifting right for alignment.
;I.e. combines normalization and alignment shifts.
* Saves: power in redundant shifts.
* Cost: precision, sometimes

;Post Multiply Normalization
:Maintain extra guad bits to the right of mantissa, to avoid normalizzation shift, since multiplication of normalized numbers can denormalize by at most one bit position. (Denorm*norm can produce greater shifts.)

;Conventional Optimizations
(A + B) * (A - B)
can [[CSE (common sub-expression)]] the alignment of A and B.
* Saves: power
* Cost: the exposed micro-optimization, possibly requiring a few extra bits of mantissa and exponent.

:Exposing primitive operations permits more effective optimization by compiler (or, for that matter, by a dynamic or [[OOO]] scheduler).

Dally suggests a fairly horizontal instruction format for these microoptimizations, consisting of
* Exponent Ops
** E+, E1 - exponent add/subtract (EC <- EA op EB) ** FF1 - returns shift required to normalize mantisssa MA (EC <- FF1 MA) * LDE, STE - load or store exponent as integer * Mantissa Ops ** M+, M-, M* - mantisssa add, subtract, multiply (MC <- MA op MB) ** SHR, SHL - mantissa shift left or right; supports [[shifting by a negative count]] in either dirrection ** ABS, NEG - zeroes and complements the mantissa sign bit ** LDM, STM - loads and stores mantissa as integer ** LDF, STF - load and store mantissa and exponent together as standard floating point number * Branch Ops ** BNEG - [[branch on exponent negative]] ** Bcond - [[branch on exponent and mantissa compare]]] (EA,MA) relop (EB,MB) = Other Possible Micro-optimizations = == Booth Encoding == Many processors use Booth encoding and multiplierarrays, for both integer and floating point. Booth encoding involves (a) preparing multiples of one operand. Unfortunatyely, the actual multiples depeend on how advanced the multiplier is - {-1,0,+1}, {-2,-1,0,+1,+2}, 3X, 4X, etc. (b) using the second operand selecting which multiples of the first operand are to be fed into the array for each digit group. We can certainly imagine [[CSE]]ing the work in preparing such multiples, for eother operand. Unfortunately, the multiples needed change with the exact multiplier used - one microarchitecture might require produced -1X and 3X multiples (1X and 2X are "free", shifting - while another might require 5X, etc. Therefore, the number of bits needed might change from chip to chip. We can imagine storing these large intermediate results in a SIMD packed vector, via an instruction that looks like: vreg256b <- booth_encode(64b) Heck - that might be a useful operation to provide even if not doing microoptimization. It has uses. The details of the Booth encoding used might be hidden. However, perhaps an easier approach might be to do this microoptimization in microarchitecture: e.g. have a [[dynamic scheduler]] recognize several multiplications that share a common first operand, and then decide to skip the [[Booth encoding]] pipestage. The skip might shorten the pipeline - or it might just be used to save power. This amounts to [[memoizing]] the Booth encoding. == Virtual Memory Translations == There are many, many, repeated translations of nearby virtual addresses that result in the same physical page address. ; CacheTranslation with Base Registers E.g. Austin and Sohi suggested caching the translations next to the base register, and thereby avoiding TLB lookup, and supporting more translatuons on a low ported TLB.

    Todd M. Austin and Gurindar S. Sohi. 1996. High-bandwidth address translation for multiple-issue processors. In Proceedings of the 23rd annual international symposium on Computer architecture (ISCA '96). ACM, New York, NY, USA, 158-167. DOI=10.1145/232973.232990 http://doi.acm.org/10.1145/232973.232990
Equivalently, power can be saved.

; Instructions to Save Translations
We can imagine that this might be exposed to software as an instruction that looks like
treg := [[translate_virtual_address]]( virtual address )
dreg := load_using_physical_adddress( treg + offset )
store_using_physical_address( treg + offset)

The first instruction saves the translation, and any permissions bits required, in treg.
treg might be an ordinary address sized integer register, or a special register to hold such a translation.

Note: this instruction, [[translate_virtual_address]], is often requested even outside the context of micro-optimization.
See [[uses of an instruction to save virtual to physical address translation]].

The load and store operations use a saved translation. I would imagine that they would trap if the memory location specified by offset and sizze fell outside the saved translation.

NOTE: in a [[virtual machine]] environment, a Guest OS executing this instruction would get at most a partial benefit.

;User Mode Instructions to Save Translations
Instructions such as [[load_using_physical_address]] can only be used by privileged code - if software can construct arbitrary physical registers.

But if the saved translation lives in a special treg, translation register, that can only be written by certain instructions,
then unprivileged code could employ this instruction.
Equivalently, this could be a [[privilege tagged capability register]] - possibly an ordinary register, with an extra bit that
gets set by the translate instruction.
The physical memory accesses would require that bit to be set.
Other operations on such a register might clear the tag bit.

This is how [[capability registers]] work for other capabilities;
physical addresses are just an extra capability.

Except... physical addresses can change, e.g. during [[virtual memory]] operations such as [[page fault]]s and [[COW]].

(1) we could expose the [[translation registers]] to the OS, which can treat them as an extended TLB, changing them as necessary.
: This is not unlike what old OSes needed to do when [[multitasking using base registers]].

(2) However, we might then need to re-translate. This could be accomplished by storing the original virtual address inside the treg [[[translation register]].

I.e. we have circled back to Austin and Sohi - except that instead of the physical address being a cache, it is more exposed.

Not completely exposed - it is a [[covert channel security hole]] to expose physical addresses.
But it is an explicitly acknowledged, albeit opaque, data quantity.

== Division Step, etc. ==

Any "step" operation, such as divide-step, multiply-step, sqrt-step,
can be considered as a micro-optimization primitive.

With the concomitant problem, that the algorithm, may vary between versions of a processor.

== TBD: other micro-optimizations ==


= References =
W. J. Dally. 1989. [[Micro-optimization of floating-point operations]]. SIGARCH Comput. Archit. News 17, 2 (April 1989), 283-289. DOI=10.1145/68182.68208 http://doi.acm.org/10.1145/68182.68208

W. J. Dally. 1989. Micro-optimization of floating-point operations. In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS-III). ACM, New York, NY, USA, 283-289. DOI=10.1145/70082.68208 http://doi.acm.org/10.1145/70082.68208

Saturday, February 12, 2011

RISC versus CISC


* [[RISC (Reduced Instruction Set Computer)]]
* [[CISC (Complicated Instruction Set Computer)]]

[[RISC]] and [[CISC]] have been discussed elsewhere, outside this wiki, in great detail.
I will not try to add much to what has already been written.

However, although I am a RISC sympathizer, I may take a contrarian stance, talking about some of the issues and problems with RISC.
The RISC philosophy was certainly influential. However, in many ways it failed.
I may try to talk about those failures.
And how the so-called [[RISC Revolution]] warped a generation of computer architects.

= Conclusion =

I did not really want to write this article. There is not much to add to all that has been written about the so-called [[RISC Wars]] except to say
that they caused a lot of thrashing, amounted to less than promised, although they did bring some improvements.

However, I could not really write a wiki on computer architecture without mentioning [[RISC versus CISC]].'

I would prefer, however, to discuss interesting aspects of RISC computer architecture, that may not be discussed in many other places,
than to rehash this old debate:

* [[Breaking out of the 32-bit RISC instruction set straitjacket]]
** [[variable length RISC instruction sets]]
** [[16-bit RISC instruction sets]]
** [[RISC instruction sets with prefix instructions]]

* [[post-RISC proliferation of instructions]]

* [[How do you count instructions in an instruction set architecture?]]

* [[Microcoded instructions and RISC]]
* [[Hardwired instructions and RISC]]
* [[Hardware state-machine sequenced instructions and RISC]]

Many of these latter issues are of the form "XXX and RISC", and really should be only of the form "XXX",
except that one of the leftovers of the [[RISC era]] is that it is considered obligatory to explain
how a feature supports or opposes the [[[RISC philosophy]].

= Background =

The late 1970s and early 1980s were an era of increasing complexity of increasing capability and complexity in computers.
The mainframe IBM 360 family was reaching the peak of its influence.
The relatively simple PDP-11 minicomputer
was supplanted by the more complex DEC VAX.
Microprocessors were marching forward: the Intel 8086 started along the road to world domination in 1978,
the Motorola 68000 started along the road to elegant failure in 1979.
Intel had been talking about the
[http://en.wikipedia.org/wiki/Intel_432 Intel iAPX 432]
microprocessor for a while, with features such as bit granular instructions,
garbage collection in hardware and microcode,
a software object model supported by hardware, etc.
People were trying to solve the so-called software gap with hardware, by building computers that more closely mapped
whatever language implementation they were targeting.

Complexity seemed increasing.
In actuality, much of the software complexity was moved into microcode.
Woe betide languages and operating systems that did not match the language and operating system use cases the computer was designed to support.

= The RISC Revolution =

Into this came a bolt of sanity:
    David A. Patterson and David R. Ditzel. 1980. The case for the reduced instruction set computer. SIGARCH Comput. Archit. News 8, 6 (October 1980), 25-33. DOI=10.1145/641914.641917 http://doi.acm.org/10.1145/641914.641917
Heck - it was not even published in a refereed journal!

One must also mention the [[IBM 801 minicomputer]], arguably the first RISC.

The late 1980s and early 1990s were full of RISC computers, especially microprocessors: [[Power]], [[PowerPC]], [[MIPS]], [[SPARC]], [[Motorola 88K]].
Not to mention many more failed companies.
All seemed destined to replace the IBM mainframe, the VAX minicomputer and,
then, as the importance of the PC market grew, the PC microprocessor.
But only the 68000 Apple PC fell to the PowerPC RISC onslaught.
The VAX and other minicomputers died out.
But the IBM mainframe, and the Intyel x86, continued, the latter spectacularly.

Now, by 2010,
* IBM is slowly transitioning to [[Power]], but the IBM Z-series mainframe stays strong
* the PC world is still ruled by Intel [[x86]]
** Intel's RISC and VLIW projects fell by the wayside
* [[PowerPC]] and [[MIPS]] have been relegated to [[embedded computing]]
* [[ARM]] is the biggest RISC success story, in embedded and, particularly, in low power, cell phones, etc.
** [[ARM]] is likely to be Intel's big competition
* Sun [[SPARC]] survives, barely - but Sun itself got swallowed by Oracle
** Sun/Oracle x86 boxes predominate even here
* DEC is long dead, as is [[Alpha]]

= What Happened? =

Many RISC companies went after the high priced, high profit margin, workstation or server markets.
But those markets got killed by [[The March of the Killer Micros]], specifically, the x86 PC microprocessor.
It is telling that ARM is the most successful "RISC", and that ARM targetted embedded and low power, the low end,
lower than the PC market, rather than the high end.

[[PowerPC]] and [[MIPS]] made a concerted attack on the Intel x86 PC franchise.
Microsoft even provided Windows NT support.
But when Intel proved that they could keep the x86 architecture competitive,
and stay the best semiconductor manufacturer, the "RISC PC" withered.

Intel proved they could keep the x86 PC microprocessor competitive in several stages:
* the i486 proved that the x86 could be pipelined.
** Up until then one of the pro-RISC arguments was that CISCs were too complicated to pipeline. But, see the next section
** I was thinking about Motorola 88Ks about this time, when the i486 started being talked about, and I realized - RISC had no fundamental advantage
* the Intel P5/Pentium did in-order superscalar
* the Intel P6, first released as the Pentium Pro, then Pentium II, did out-of-order
** briefly, the fastest microprocessor in the world, beating even DEC Alpha

Some say that the P6 killed RISC.

A more nuanced view is that RISC was a response to a short term issue:
the transition from board level to chip level integration.
When only a small fraction of a board level computer could fit on a chip,
RISC made more sense. When more could fit on a chip, RISC made less sense.

Not no sense at all. The RISC principles always have value.
But less sense, less of a competitive advantage.
Unnecessary complexity is always wasteful.

Moving on...

= The CISCs that survived were not that CISCy =

The CISCs that failed - DC VAX and the Motorola 68000 - were the most CISCy.
Most instructions were variable length.
Some frequently used instructions could be very long.
Many instructions had microcode.
Many operations had side effects.
They had complicated addressing modes - elegant in their generality, but coompliocated, sometimes neceessitating microcode just to calculate an address.

The CISCs that survived - the IBM 360 mainframe family, and the Intel x86 - were not that CISCy.

Sure, they had some complicated, almost impossible to implement without microcode, instructions.
Just consider x86 FAR CAL through CALL GATE.

However, most of their instructions were simple:
ADD src_reg += dest_reg_or_mem.
True, Note [[load-store]].
But [[load-op]] pipelines were not that hard for the IBM mainframes, and the Intel i486 and P5, to implement.
And the Intel P6 showed that the microarchitecture could be [[load-stiore]], even though the ISA was not.

The IBM 360 family has relatively simple instruction decoding.

The Intel x86 has painful instruction encodings, ranging from a single byte up to 15 bytes.
But the most frequently used instructions are small.
The really complicated instruction encodings, with prefixes, proved possible to (a) implement in hardware, but (b) at a significant performance cost, on the order of 1 cycle per prefix originally.

Most important, these instruction sets had few side effects.
Yes, the x86 has condition codes.
But most x86 instructions overwrite all of the arithmetic condition codes (INC and DEC not affecting the carry flag being the notable exception).
This avoided the sort of RAW hazard on the condition codes that would have been required for an OOO implementation of, say, the Motorola 68000.

= Didn't RISC win anyway? =

Did not RISC win anyway? After all, doesn't a modern CISC processor translate to RISC uops internally?

Well, yes and no. Let's look at some of the RISC principles proposed in early papers

* fixed length instructions - 32 bit
** at the [[macroinstruction set level]], not so much
** at the [[microinstruction set]] or [[UISA]] level, maybe
*** maybe- but definitely not "small". Is it a RISC if the (micro)instructions are 160 bits wide
*** even here, the recent trend is to compressing microcode. Some are small, some take more bits.
** the most popular surviving RISC instruction sets have 16 bit subsets to increase code density

* simple instruction decoding
** ISA level, no
** UISA level - undocumented. Probably. But, again, very wide!!

* software floating point
** nope!!

* large uniform register set
** in the early days, not so much
** over time, the register set has grown. As has the complexity of the ISA encoding, [[REX bytes]] etc.

* small number of instructions
** definitely not!!!
** microcode instruction sets have long been full of many instructions, particularly widget instructions
** even macroinstruction sets have increased dramatically in size since 1990. More than quadrupled.

Some have said that the point of RISC was not reduced instruction count,
but reduced instruction complexity.
This may be true - certainly, this was always the argument that I used *against* rabid RISC enthusiasts who were trying to reduce the number of instructions in the instruction set.
But nevertheless, there were many, many, RISC zealots and managers who evaluated proposals by counting instructions.

= What's left? =

The most important intellectual survivor of the RISC philosophy has been to aversion to complicated microcode sequences.
Expanding instructions to 1 to 4 [[uop]]s may be okay,
but taking a cycle or more to branch into a [[microcode sequencer]], perform the operation, and branch back is a performance penalty nobody wants to take.

The number of instructions in instruction sets has increased dramatically over the past decade.
But the vast majority of these are instructions that can be implemented directly by hardware,
by combinatoric logic,
or by simple state machines in the case of operations such as divide.

One may conjecture that the original RISC aversion to floating point
was actually to microcode floating point:
when the most important floating point operations like [[FADD]] and [[FMUL]] became pipelined,
capable of being started one or more per clock
even though the operation took several cycles of latency,
the objections to FP on a RISC died away.

== Bad Effects ==

We are still paying the cost of certain RISC-era design decisions.

For example, Intel MMX is irregular.
It does not have all the reasonable combinations of
unsaturated, signed and unsigned saturation.
This irregularity was NOT because of hardware complexity,
but because management was trying to follow RISC principles by counting instructions.
Even when providing all regular combinations would have made the hardware simpler rather than harder.
(Aside: validation complexity is often used as an argument here, against regularity. The complexity of validating all regular combinations grows combinatorically.)

AMD x86-64 is not a bad instruction set extension.
But life might be easier in the future, if x86 does not die away, if it had been more regular.
More RISC-like.
But in this way RISC was its own enemy:
RISC did not achieve the often hoped for great performance improvements over CISC.
RISC reduces complexity, which does not directly improve performance.
So people went chasing after a Holy Grail of VLIW performance, which also did not pan out.

= Conclusion =

I did not really want to write this article. There is not much to add to all that has been written about the so-called [[RISC Wars]] except to say
that they caused a lot of thrashing, amounted to less than promised, although they did bring some improvements.

However, I could not really write a wiki on computer architecture without mentioning [[RISC versus CISC]].'

I would prefer, however, to discuss interesting aspects of RISC computer architecture, that may not be discussed in many other places,
than to rehash this old debate:

* [[Breaking out of the 32-bit RISC instruction set straitjacket]]
** [[variable length RISC instruction sets]]
** [[16-bit RISC instruction sets]]
** [[RISC instruction sets with prefix instructions]]

* [[post-RISC proliferation of instructions]]

* [[How do you count instructions in an instruction set architecture?]]

* [[Microcoded instructions and RISC]]
* [[Hardwired instructions and RISC]]
* [[Hardware state-machine sequenced instructions and RISC]]

Many of these latter issues are of the form "XXX and RISC", and really should be only of the form "XXX",
except that one of the leftovers of the [[RISC era]] is that it is considered obligatory to explain
how a feature supports or opposes the [[[RISC philosophy]].

Wednesday, February 09, 2011

Non-speculative or less speculative versus more speculative


When discussing [[speculative execution]] techniques such as [[speculative multithreading]] (or [[skipahead]], or [[slipahead]], or ... )
we often need to talk about less speculative and more speculative threads or instructions.

E.g. a later, younger, more speculative load instruction may be blocked by an earlier, older, less speculative store instruction to the same address.

E.g. in [[SpMT]] or [[SkMT]] a less speculative thread may fork a more speculative thread.

Often in early work we only talk about non-speculative threads forking,
and do not discuss the possibility of a speculative thread forking.
Late, it is realized that it is entiirely reasonable for speculative threads to fork.
Thus, often we need to read "non-speculative" in early work,
and implicitly substitute "less speculative".

E.g. a common [[speculative thread management policy]] is to always keep the least speculativethreads,
and cancel any more speculativethreads,
when there is an opportunity to spawn a less speculative thread.


Note that "less speculative" is not necessarily the same as "older".
E.g. in a [[fork-on-call]] thread, certain instructions may beindependent of the skipped code,
and be guaranteed to be executed;
whereas other instructions, inside a skipped function,
or older in the [[Von Neumann instruction sequence]],
but are actually more speculative, as in less likely to be executed.
However, it is often too hard to track this, so often discussions
will implicitly assume "less speculative" is "older".


Note that it is not always possible to determine the order of speculative threads or instructions.
E.g. one may fork loop bodies out of order, with no known order,
and later string them together as memory items iterated on by the loop are encountered.
However, arbitrarily imposing an order simplifies many speculative algorithms,
even at the cost of potential performance.


* See[[spawning versus forking a thread]].

Skipahead Multithreading


[[Skipahead multithreading (SkMT)]] is a form of [[speculative multithreading (SpMT)]]
characterized by "skipping ahead" at certain points in
[[non-speculative or less speculative versus more speculative|non-speculative or less speculative execution to more speculative execution]].

Typically these "certain points" in thecode are places
where there is a well characterized [[control independence or convergence]] point:
* the instruction after a CALL instruction
* end of loop
* later iterations of loop
* IF convergence

I, Andy Glew, coined the term [[SkMT]]
when it became evident that the term [[SpMT]],
which was itself coined by Antonio Gonzales and promoted by me,
was more generic.
I.e. you can imagine creating speculative threads
that do not really skip that far ahead,
but which, e.g. execute past a place where execution would bee blocked,
either an in-order blockage, or where an OOO window would be full.
See [[non-skipahead speculative multithreading]].

    Oooo.... I just created a new term: [[slip-ahead multithreading]]. It rather nicely encapsulates what I just described, and is consistent with published project names such as [[slipstream]].

In much the same way, I had earlier used the term [[implicit multithreading (IMT)]],
and replaced it by [[speculative multithreading (SpMT)]],
which I am now (after, 10 years ago, 2000) specializing to [[skipahead multithreading (SkMT)]].

The term [[skipahead]] is intended to be contrasted with [[lookahead]],
a term which was once used to characterize all [[out-of-order (OOO)]] execution.

    Look-ahead processors. Robert M. Keller, Princeton. ACM Computing Surveys, Vol 7, Issue 4, Dec 1975. http://citeseerx.ist.psu.edu/viewdoc/download?doi=

Sunday, February 06, 2011

Managing branch prediction history: copying short history versus pointing to large history


When [[branch prediction history]] was short, 8-12 bits, and global,
then it was not unreasonable to manage the history by copying it.

E.g. in one simulator I actually arranged so that a branch [[uop]]
wrote  back, as its result
* an indication of whether it was mispredicted or not
* the taken [[target IP]]
* the branch predictor history to be restored on a branch misprediction.

I.e. the branch prediction history was propagated, in this simulator, from the [[instruction fetch front end]]
across the scheduler to execution, and back again.

While simple, this involves a lot of unnecessary data movement - both for the history, but also for the [[target IP]].
Most machines of my acquaintance create a [[branch information table (BIT)]], holding information for branches in flight.
This avoids copying the history from the front end to execution and back again,
but nevertheless may involve making copies of  the history.

    I confess an ulterior motive for copying the history and other branch information around: this naturally leads to the number of branches in flight scaling with the window size. Many early OOO designs were crippled by supported too few branches in flight.

Making copies of the history seems silly if, as in a TNT history, they differ only by 2 bits:
 new_history := (old_history << 1) | new_branch_taken_or_not

    (Note: when shifting a branch history, we will often say h<

But making copies seems to be required if we want to be able to restore to any mispredicted branch point, i.e. if we want to do [[instantaneous versus incremental branch misprediction repair]].

Making copies works well enough if the branch prediction history is small enough - 8 bits, etc.
But by the time we are talking about 16 bit histories and 32 or 64 bit IPs, we are talking about quite a few bits.

Furthermore, in the  late 1990s and early 2000s branch predictors arose with  hitherto unseen long histories,
such as Seznec's OGEHL predictor (with multiple history lengths 9, 2, 4, 8, 16, 32, 64, 128, ...).
Copying around a 128 bit history, even to a [[BIT]], is wasteful;
copying around the even larger 1000+ bit histories that have been proposed is even worse.

Hence the interest in pointing to a position in a branch prediction history,
rather than copying the entire history.
On a branch misprediction one would restore the pointer, rather than overwriting the  history with a savedcopy.

This would  be straightforward for a long [[TNT]] history, since only 1 bit depends on any branch.
You would simply keep a history of total length TL=PHL+BIF, the sum of predictor history length plus branches in flight.
On a misprediction you would restore the pointer in this circular buffer.
(Or equivalently shift the buffer- I suspect that shifting is too power hungry.)

This could scale to almost any length of history, potentially thousands of bits.

Unfortunately, modern [[stew]] histories are more complicated than [[TNT]] histories.
The [[branch IP]], or even both [[from IP]] and [[to IP]], may be [[hashed, e.g. XORed]] into the stew.
This means that several of the youngest bits in the history may change on every branch.
Simply restoring a pointer will not suffice.

TBD: explain stew management in more detail.

Simple strategy:  constrain the stew to have only the N youngest bits affected by the most recent branch.  Bits TL..N are unaffected  by the most recent branch, except for shifting.
One can then keep a copy of the parts of the history affected by recent branches, the N youngest bits, and a pointer that locates the older bits.

= See Also =

* [[How to use a really long predictor history]]

Branch Prediction Stew

{{Terminology Term}}

The [[stew]] is a form of history used by certain branch predictors.

See, for example, US patent 7143273,
Method and apparatus for dynamic branch prediction utilizing multiple stew algorithms for indexing a global history,
Mile, Slade, and Jourdan,
filed March 31, 2003,
assignee Intel.

A simple [[branch predictor history]] might be  a simple [[TNT]] history,
with 0s corresponding to non-taken and 1s corresponding to taken.
Such a simple TNT history cannot distinguish some convergeing paths,
and indirect branches.

describes one embodiment of a stew as
 stew = ((stew << 1)|new_bit ^  ip)

A stew formed  in this way can distinguish converging paths such as  ("if true" means a condition that evaluates as true, not an unconditional branch):
   L1: if true got L2
   L2: if true goto L99
   L10: if true goto L11
   L11: if true  goto L
   L:  if ?? goto L99

However, it does not distinguish multiple  branch targets and paths out of an indirect branch, such as
   L1: Reg:= IL1; if true goto L2
   L2: if true goto L99
   L10: Reg:=IL2; if true goto L11
   L11: if true  goto L
   L:  if ?? goto [Reg]
   IL1: ...
   IL2: ...
It can be seen that mixing in arc information as well as node information remedies this situation,
and distinguishes different paths so long as the hashes do not collide:

 stew <<= number_of_bits_to_discard
 stew = hash( stew, from_IP, to_IP, taken/not_taken, ...)

Issue: how many bits to use?  Which may vary as a function of the type of branch: e.g. a direct conditional branch
may not need as many to_ip bits to be hashed in
as a completely random indirect branch.
Similarly, indirect calls and returns may be handled separately.