The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, February 27, 2011



There used to be a big debate in computer architecture about whether
one should have [[Unified versus Split Integera and Floating Point]].
Both ISA or macroarchitecurally, in the programmer exposed register
file. Or microarchitecturally, in the implementation.

E.g. the [[Intel P6 microarchitecture]] implemented the [[Intel
x86/x87 ISA]], with separate integer and floating point. But the
original P6 microarchitecture implemented this is a unified datapath,
in which both integer and floating point use the same scheduler, and
occupy the same [[PRF]] in the form of the unified [[ROB]], each entry
of which can hold a full 80+ bit x87 FP value. But they are split at
the [[RRF]].

Over time, Intel added [[SIMD packed vector extensions]]. [[MMX]]
shared the x86 registers, both architecturally and
microarchitecturally (in most implementations). But wider SIMD, such
as SSE with 128 bit [XMM]] regisyyers, and AVX with 256 bit [[YMM]]
registers, introduced yet another split architectureal register file.
Microarchitecturally, they may have kept the P6 unified structure, but
the wastage of storing an 8, 16, 32, or even 64 bit scalar in a 128 or
256 bit wide PRF entry becomes more and more of a concern.

AMD, by contrast, has typically had separate integer and FP/SIMD
clusters, with separate schedulers and PRFs and ...

Intel's 2011 processor, Sandybridge, has split integer and SIMD PRFs,
a unified scheduler, but split scalar and SIMD clusters.

It is becoming increasingly obvious that any desire to split datapaths
should be based not on type, not on integer versis floating point, but
should be based on width, intrger or scalar versus SIMD packed
vectors. Narrow data types, 8, 16, 32, 64 bits, versus wide packed
data types, 128, 256, possibly even 512 bits wide. Latency versus

Currently it is pretty standard to separate

* integer, scalar, 8, 16, 32, and 64 bit data. Addresses and branches. Latency sensitive

* SIMD packed vector, both integer and FP. 128, 256, and possibly 512 bit data. Bandwidth.

The only real question is whether one should support scalar FP on a
datapath optimized for low latency, and interaction with branches and
addresses. Or whether scalar FP should only be handled as a single
lane in a vector register, as in SSE, in a throughput manner.

Now, there will always be workloads where FP scalar latency will
matter. For that matter, there will always be workloads where vector
latency matters. Nevertheless, this division into narrow latency and
wide throughput datapaths seems to be a trend.

It will always be desirable to be able to access the bits of an FP
number, whether scalar or vector, in an ad-hoc manner. It will always
be desirable to branch and form addresses based on such accesses. But
these desires will not always outweigh the separation into narrow
latency and wide throughput clusters.

Power concerns now are paramount. Split datapaths naturally save
power; but unified datapaths may be able to save power in an even
finer grained manner. Much depends on the granularity possible with
[[clock gating]] (which can be quite fine grained, e.g. to bytes of a
datapath) and [[power gating]] (which in 2011 is quite coarse grained,
unlikely to be done within the width of a unified datapath). For that
matter, separate narrow/latency and wide/throughput clusters would
probably be easier to clock at different frequencies, with different
coltages - although as of 2011 this does not seem yet to be done.

I can make no hard and fast conclusion: although there is a trend
towards split narrow/latency and wide/throughput datapaths, there will
still be occasional winning attempts to unifiy. Much depends on the
evolution of power, frequency, and voltage management, in terms of

Why non-inclusive/non-exclusive caches?

Sheesh! I can't usually take all the time this article needed to write in one sitting.



[[Non-inclusive/non-exclusive (NI/NE)]] caches are used in
[[multilevel cache hierarchies]] that are, duh, neither [[inclusive]]
nor [[exclusive]].

I.e. there is no enforcement of either the [[cache inclusion
property]] nor the [[cache exclusion property]]. A cache line in an
inner cache may or may not be in an outer cache. You can't be sure.
If necessary, you must check.

Because there is no [[cache inclusion]], both the inner and outer
caches must be checked, i.e. snooped or probed, when responding to
requests from other processors.

= Historical Background - Accidental Inclusion in the Original [[P6]] =

Probably the most widely known processor that has an [[NI/NE]] cache
is the [[Intel P6 family]].

The original P6 had an external L2 cache, on a [[back-side bus]]. In
an ascii diagram (TBD, I need drawing in this wiki):

[[L2$]] <---> P([[I$,L1$]]) <---> [[FSB]]

In more detail

[[L2$]] <---------> [[SQ]] <---> [[FSB]]
| | |
v v v
[[I$]] [[P]] [[L1$]]

Most of the symbols used in the diagrams above are standard; as for
those that are not:

* [[SQ (super queue)]] - on the [[P6 Pentium Pro]] die, coordinated interaction with the [[Front side bus (FSB)]], the back side buss and the off-chip [[L2]], and [[DCU]], mainly interacting with the [[fill buffers (FB)]].

* [[FB (fill buffers)]] - a staging area under the control of the [[data cache unit (DCU)]] of the P6 cpu core, talking to the [[bus cluster]] and hence the [[SQ]], [[L2$]], and the outisde world via the [[FSB]], and the caches of the [[processor core]], the [[IFU]] and [[DCU]] arrays. Also talked directly to the processor core, to minimize latency.

The original P6 had a completely external [[L2$]]. Not just the
[[cache data array]]s, but also the [[cache tag array]]s.

Now, there were several reasons why the original P6 had [[NI/NE]] or
[[accidental inclusion]]:

== Backside L2$ ==

Because the L2$ was on a backside bus, it was not necessarily faster
to access than the L1 caches, the [[IFU]] and [[DCU]]. In fact, you
can easily imagine circumstances where it might have been slower to
[[snoop]] or [[probe]].

    Note; having the L2$ tags on-chip might have made it possible to probe the L2$ faster than the L1$.

== [[Deterministic Snoop Latency]] ==

One of the design goals of the original P6 was to have [[deterministic
snoop latency]] - to be able to respond within a relatively small
fixed number of cycles to an snoop from remote processors on the

[[Cache inclusion]]] would get in the way of this. You would first
have to prove the L2$, and then, if a hit, probe the L1$. This
naturally leans towards a [[variable snoop latency]], and slows the
snoop down.

In the long run I think that [[deterministic snoop latency]] must be
considered a worthy goal, but a dead-end. I think even the original
P6 gave up on it, providing it in most cases, but also provided a
snoop stall as a way of handling cases where it could not be provided.
Nevertheless, the goal of [[dererministic snoop latency]] influenced
the design.

== L2$ relatively large compared to the L1$ ==

The external, off-chip but inside package, L2$ for the original P6
Pentium Pro was 512Kib, 1Mib, with some versions as small as 256KiB.

The DCU or L1 data cache was die dieted over the life of the project
from 64KiB in our dreams to 16 or 8KiB.

Thus, the L2 was typically 32x bigger than the L1 data cache,
and similarly wrt the instruction cache.

Although Intel has experience with [[exclusive cache]]s from earlier
processors, with a 32:1 or 16:1 size ratio, there is little point in
using an exclusive cache to try to increase cache hit rate. The
increase in the hit rate from having circa 3-66% more data is small.
Most of the time the data in the L1$ is included in the L2$ buyy

This property Wen-Hann Wang called [[accidental inclusion]]. Wen-Hann
Wang was the leader of the P6 cache protocol design, and the inventor
(in his PhD) of [[cache inclusion]].

== Simplifying Design ==

In truth I must admit that we started P6 expecting to build an
[[inclusive cache]] hierarchy. However, Wen-Hann's performance data
showed that [[accidental inclusion]] because of the relatively large
L2$ removed most of the performance incentive for inclusion. And
moving to [[NI/NE]] considerably simplified the cache protocol,
removing many boundary conditions.

For example, how do you handle cases where a cache line is being made
dirty in the L1 exactly as it is being evicted from the L2? ... It is
not impossible to handle these cases. But [[accidental inclusion]]
made it just plain easier to leave them decoupled.

    By the way, at the time on P6 we did not really have a term for the cache inclusion properties of the P5 L2, or lack thereof. Wen-Hann called it [[accidental inclusion]], but that term has stuck only in Glew's memory. The term [[non-inclusive/non-exclusive (NI/NE)]] was created years later.

== [[Silent eviction of clean data]] ==

It is, of course, always necessary to perform a [[writeback]] when
[[evicting dirty data]].

When [[evicting clean data]], however, one can silently drop the clean
line. It is not always necessary to inform the outer cache.

Now, this isn't 100% linked to [[cache inclusion]] or [[NI/NE]]. One
can perform [[silent eviction of clean data]] in an [[inclusive
cache]]. However, doing so means that any [[inclusion tracking]] that
the outer cacheis performing, so that it knows which cachelines in the
outer cache are inside which inner cache(s), may be inaccurate.
I.e. [[possibly inaccurate inclusion tracking ]].

[[Accurate inclusion tracking]] is a [[slippery slope]]. It is not
necessary, but it has several nice properties. Worse, some people
implicitly assume [[accurate inclusion tracking]] in any [[inclsuive
cache]] design.

[[Accidental inclusion]] or [[NI/NE]] avoids this slippery slope
altogether. Which not only simplifies design, bit also avioids the
sort of incorrect assumptions that often arise on such a [[slippery slope]].

== Allocation in L1$ without L2$ interaction ==

[[Accidental inclusion]] enabled or made easier certain cache protocol
optimizations, in particular [[eliminating unnecesary RFOs]] for
full-line cache writes, without requiring L2 involvement.
Especially as used in [[P6 fast strings]].

This probably was not a design goal for most of the people involved,
but it was something I (Andy Glew) was happy to take advantage of.

The basic observation is that a sizeable fraction of all cache lines
are completely overwritten, without any of the old data being read.
Therefore, reading the old data via an [[RFO (read for ownership)]]
bus request is wasted bandwidth.

In operations such as [[block fill]], e.g. zeroing a page of memory,
50% of the data bus bandwidth is wasted by such unnecessary RFOs. In
[[block copy]], 33%, (Assuming all miss the cache - not that bad an
assumption for an 8 or 16Hob L1$, but also not unreasonable even on
many larger caches, for many [[block memory operations]].)

Even in non-block memory code, unnecessary RFOs occupy a significant
fraction of data bus bandwidth. I have measured speedups of 10-20% on
many simulated workloads, and even on a few [[FIB-edited chips]].

    The biggest challenge with [[eliminating unnecessary RFOs]] is maintaining [[memory ordering]], in particular [[store ordering]]. It is easy to do on a [[weakly ordered]] system. I have invented several techniques for maintaining memory ordering, some of which are [[NDA]], but some of which can be discussed here. When I get around to writing them up. In my [[copious free time]].

The original P6 eliminated unnecessary RFOs using the [[deferred
invalidation cache protocol]] for [[fast strings]] [[block memory
operations]]. The bandwidth benefit was clear; unfortunately, the
overhead of microcode, in the absence of microcode branch prediction,
made fast strings not always the best approach. Since the original P6
[[fast strings]] have been hit and miss: when somebody in microcode
gets around to tuning them they can be a win, but if untuned they may
be a loss compared to the best hand rolled assembly code.

P6's [[WC memory type]] also eliminated unnecessary RFOs, but was not
allocated in the cache. Similarly the [[SSE streaming stores]] added
in later processors.

The [[deferred invalidate cache protocol]] continued in use all the
way to Intel's Nehalem processor, the last of the P6 family. Nehalem,
however, added an inclusive [[LLC]] with snoop filtering, killing the
deferred invalidate cache protocol.

So much for history. The point remains: not having an inclusive cache
enables certain cache protocol optimizations, such as [[eliminating
unnecessary RFOs]], that are much harderto buildin an inclusive cache

== Optionally Removing the L2$ ==

The [[NI/NE]] cache structure makes it easier to optionally disable or
remove the [[L2$]]. This option has been explored several times.

Now, it is possible to do this with an inclusive cache structure by
"faking out" the absent L2$, always returning "Yes, the cache line is
included in all possible internal caches", such as the instruction and
data and other caches.

But it is very easy, in an inclusive cache hierarchy, to slip dfown
the slippery slope and assume that the L2$ is present. And then to
take advantage of the L2$ being present in a way that breaks when the
L2$ is absent. [[NI/NE]] seeems to avoid this.

== Enabling [Funky Caches]] and Decoupling Associativity ==

In an [[inclusive cache]] system, one tends to want the associativity
of the outer, inclusive, cache to be greater than or equal to the sum
of the associativities of the inner caches it "covers".

E.g. if you have a 16 way associative outer cache that is inclusive of
8 inner caches, all of 4 way associativity, and if the usual indexing
of the cache using bitfields is performed (which usually implies a
1:many relationship between inner sets and outer sets), then it is
possible that for some set So of the outer cache mapping to sets Si of
each of the inner caches, each inner cache may want to have a distimct
line in each of the 4 ways. Implying that the inner caches may "want"
to have 8*4 distinct lines. Which exceeds the 16 way associativity of
the outer cache.

[[NI/NE]] caches do not have this constraint on associativity.

Furthermore, when playing the [[Funky Cache Rag]], exploring
structures such as [[victim caches]] or [[prefetch caches]] or
specialized caches for different data types, [[cache inclusion]] once
again gets in the way. Often these [[Funky Caches]] are fully
assiociative. They do a great job of reducing glass jaws related to
cache associativity - e.g. a 16 or 32 entry victim cache makes a 4 way
cache look like it has a much higher effective associativity. But a
32 entry fully associative victim or prefetch cache may only be able
to use 16 entries, if all entries map to the same way in the outer
cache. [[Cache inclusion]] prevents some of the worst [[glass jaws]]
that these caches can fix, from being fixed.

Now, these interactions are not necessarily large, at least in average
effect, although they can be a fertiole source of [[glass jaws]].

However, they do couple the design of the outer cache and the inner
cache. They mean, for example, that the outer cache design team
cannot simply change associativity without checking with the inner
cache design team, which may be making assumptions based on the
earlier outer cache design team [[POR]] for associativity. It means
that the outer and inner caches are more tightly coupled. Which can
lead to more efficient designs. But which can also make it slower and
harder to innovate.

== Backwards Invalidation Traffic ==

There are two main methods to maintain [[cache inclusion]]:
[[backwards invalidation]], and the considerablly less practical
[[outer cache victim identification by inner cache]]. I will ignore
the latter.

[[Backwards invalidation]]s are simple and straightforward. But they
also consume power, affect performance on an outer to inner cache
interconnect that may be saturated. And, in any case, are an
incremental source of complexity.

[[NI/NE]] cache hierarchies eliminate [[backwards invalidation]]s.

Now, in the 2011 Intel processors that motivated this article,
Sandybridge, with an inclusive snoop filtering [[LLC]] but a [[NI/NE]]
[[MLC]], backwwards invalidations are still required for LLC
evictions. But not for MLC evictions.

Let's talk some scenarios:

In workloads that use mainly read-only data, [[NI/NE]] allows [[silent eviction of dirty data]], but [[cacheinclusion]] requires BIs.

In workloads that use a lot of shared data, there may be multiple BIs for every inclusive cache eviction.

Etc. etc.

It is by no means certain that power saved by not [[probe filtering]] is not outweighed by the power saved by eliminating BIs.
It all depends on the type of multiproceessing in the system.
If the coherency traffic from beyond the outer cache is small relative to the cache thrashing inside the system,
e.g. if you have 16 cores crunching on large data serts, but no other multicore CPU chips and relatively small I/O,
BI traffic may outweigh snoop filttering.
But if you are building a very large shared memory system with many other CPUs and much cache coherent I/O traffic,
then inclusion may win.

= Are [[NI/NE]] caches still important? =

Obviously, since [[NI/NE]] caches are still in use in the Intel
Sandybridge processors being released at the time of writing in 2011,
they are still important in a practical sense? But is this just
[[legacy computer architecture]] dating back to the original P6? If
they were desiged from scratch, would [[NI/NE]] still be built

Maybe... lacking access to full simulation results, it is hard to say.

However, many of the advantages of [[NI/NE]] relate to making design
easier, decoupling inner and outer cache design. This is probably one
source of the longevity of the [[P6 microarchitecture]], which is now
coming to an end with Sandybridge.

Certainly, the old advantage of [[eliminating unnecessary RFOs]] no
longer stands. But that was always one of the most minor advantagesof

Much depends on the system. [[Probe filtering]] [[coherency traffic]]
makes a big difference in large multiprocessor systems with many outer
caches. It also helps with large amounts of high bandwidth cache
coherent I/O, as sometimes arises with ill-tuned [[discrete GPU]].
(Most discrete GPUs are tuned to use [[non-cache coherent I/O]], if
accessing [[UMA]] or processor memory at all.)

But as multicore systems get more cores,
and GPUs are largely integrated,
there may be less bandwidth on the other side of the [[outermost cache]].

== Where does cache inclusion win? ==

=== Snoop or probe filtering ===

One big win for cache inclusion is [[snoop filtering]].

However, such [[probe filtering]] does not necessarily require that
the cache data be inclusive. A data-less [[probe filter]] or
[[ownership cache]], a structure that allows ownership to be tracked,
and which permits lines to be exclsuively owned even though in main
memory may provide the benefits of [[probe filtering]] without the
costs of cache inclusion on a relatively small data cache. At the
cost of a new [[hardware data structure]], the [[ownership cache]].

Basically, the ownership cache maintains the cache inclusion property,
not necessarily the data cache. Since the ownership cache maintains
only a few state bits per cache line, the ownership cache can covedr
much more memory than an ordinary data cache. It is therefore less
liable to the issues of cache associativity.

=== Power management ===

Probably the most important win for [[cache inclusion]] is wrt power
management - really just an elaboration of [[snoop filtering]].

If you have [[cache inclusion]] you can power down all of the inner
caches, putting them into a low power mode that maintains state, but
which does not snoop.

Whereas if you have a [[NI/NE]] cache, you either must keep snooping
the inner caches, or you must flush the inner caches when going into
such a [[no-snoop cache mode]].

Now, you always needto be able to do such cache flushes. But they
take time, and if they are not required you can enter tye low power
mode more quickly, and more often, and thereby save more power.

Note that an inclusive LLC in combination with a NI/NE [[MLC]] and [[FLCs]]
gives the best of both worlds in this regard.
Note also that a datalerss [[snoop filter]] also works.

= Conclusion =

Many people, taught in school about [[inclusive caches]], are
surprised to learn how common and important [[exclusive caches]] and
[[non-inclusive/non-exclusive caches]] are, or at least have been, in

It is by no means clear that any approiach is always best. All 3
approaches have pros and cons, and win in some situations.

Cache replacement - tweaks and variations

{{Terminology Term}}

[[Cache replacement]] or [[cache victim selection]] is the method one uses to choose which cache line will be overwritten or replaced, i.e. victimized, when a new cache line is pulled in.

Common algorithms include
* [[Random replacement]]
* [[LRU (least recently used)]]
** approximations to [[LRU]:
*** [[pseudo-LRU]] or [[tree LRU]]
*** [[the clock algorithm]]

= Minor Tweaks =

== When is the cache [[LRU information]] or [[cache usage information]] updated? ==

E.g. do you update it speculatively, for instructions that may be on a [[branch misprediction wrong path]], thereby corrupting the LRU?

Should you update at retirement? (Probably not, but just in case.)
:By the way, here is a reason to update at retirement: so that you get more [[deterministic behavior]], which eases validation. However, this only works if you do not do [[speculative cache misses]].

Perhaps you should only update it for slightly speculative instructions:
* e.g. do not update for highly speculative [[SpMT]] threads?
* e.g. update only for isntructions that have not retired, but for which all earlier branches have been resolved. (Such instructions are stil speculative - there may be a page fault or exception or interrupt - but they7 are only [[slightly speculative]]).

== What requests update the LRU information? ==

Should all requests, reads and writes, update the LRU information equally?

Many have proposed [[non-temporal hint bits]] in instructions, to say "ignore this access".

It is an open issue whether prefetches, whether initiated by [[prefetch instruction]]s or by [[hardware prefetchers]],
should update the LRU.

Multilevel caches:
* update the LRU on all accesses
* on misses
* on [[LRU leak-through]] from the inner cache
* only on capacity evictions from the inner cache (suggested in comp.arch by EricP on 2/25/2011)

By the way, the idea of updating the cache usage information only on capacity evictions from the inner cache
exposes several issues:

    Updating the LRU information and advancing the LRU pointer are two different issues. Invalidation traffic may result in several coherency cache misses between capacity misses. It would be bad to keep choosing the same victim. Oftentimes coherency misses do not need a victim chosen: they fill into the empty or stale or non-present line left behind by the invalidation. (May not happen in all systems.) Dirty writebacks naturally notify the outer cache of a capacity replacement. However, replacing a clean line may not naturally require such notiification: i.e. we may have [[silent replacement of clean cache lines]]. This may require [[LRU leakthrough]] so that the outrr cache can track.

== When is the victim chosen? ==

Should you choose the victim at the time the cache is missed,
or at the time the data to be placed in the cache line has returned?

It may not matter on an in-order machine. However, on an out-of-order machine, chooising the victim early may raise isssues, such as what should happen if too many cache misses to the same set occur
- i.e. if the victim itself needs to be victimized before the first victim's replacement has arrived.

One advantage of choosing the victim early is that you may be able to send the data for the [[cache line fill]]
returning directly to where it belongs in the cache - you may not have to stage it through a [[fill buffer]],
and you may be able to avoid fairly expensive [[fill buffer forwarding]] logic.
I.e. you may be able to have [[data-less fill buffers]].

    Half baked Idea: choose the victim early. But also allocate a fill buffer. At [[fill time]], determine if the early victim choice is still accurate. If not, write to the fill buffer. (I call this a half baked idea because it doesn't really solve the problem of wanting to avoid data fill buffers. Elaboration: choose the victim early. Allocate a [[address and control fill buffer]], but do not allocate a [[data fill buffer]]. If the victim is thrashed, allocate a [[data fill; buffer]] (which may simply be a [[spill buffer]] allocated circularly. At fill time, choose.

== Biasing the victim choice ==

* Prefer invalid lines, rather than replacing valid lines.
* Prefer clean lines to dirty lines
:: Thereby avoiding [[dirty writeback]] traffic.

* in multilevel caches, prefer lines that are in none of, or the fewest, inner caches

* prefer to replace lines containing data to lines containing instructions
** or vice versa - although oftentimes [[I$]] misses are more expensive than [[D$]] misses

* extend the above to more data types - integer versus FP (FP is often accessed in a cache thrashing manner)

= Better cache replacement algorithms =

There has been much work on trying to adjust these algorithms.

Ideally, we know that [[Belady's algorithm]] is optimal for many assumptions.
It amounts to "replacing the cache line that will be used furthest in the future".
It must be adjusted when there are non-uniform costs.

    There have been attempts to [[approximate Belady cache replacement using speculation and lookahead]]. E.g. do not replace a line that is used in an instruction window. This works better when you have a really large instruction window, such as might be provided by [[SpMT]]. It also may not be necessary if the [[LRU bits]] or [[cache usage information]] is updated speculatively. Others have attempted to unroll memory access pattern predictors, to get a list of predicted accesses, against which a Belady query can be made

See [[Victim Choice for Multilevel and Shared Caches]] for a discussion of issues in multilevel cache victim selection.
One of the main issues is that in a [[multilevel cache hierarchy]] the LRU bits of the outer caches are not adjusted
by accesses to the inner caches, so choosing a victim based purely on LRU bits updated only by accesses sent to the outer cache
is often not good.
Many proposals [[leak-through LRU]] information, to allow the outer cache to track the inner LRU.

It is well known that certain access patterns are not well suited to LRU cache replacement.
For example, circularly accessing N lines, in a cache of M lines, M < N, is better suited by MRU cache replacement than LRU.
Many have proposed to exploit this, e.g. by [[non-temporal]] hint bits attached to instructions,
or by predictors that attempt to identify such non-temporal cache access patterns.

Reading List for Computer Architecture (seed)



I am often asked for a reading list for computer architecture.

Unfortunately, there aren't many good books in computer architecture.

= [[Hennessy and Patterson]] =

[[Hennessy and Patterson]] is the current must-read, since its first publication
but that is not really a comprehensive book or survey of the field,
as a text book suitable for classes, espousing a quantitative design philisophy.

In fact, one of the reasons I am trying to create the comp.arch wiki is as an antidote to Hennessy and Patterson.
I want to collect a survey or grab bag of many techniques not mentioned in [[H&P]],
or potentially deprecated by them.
Many students raised on H&P, particularly the early versions, developed a common attituide
- they were [[RISC bigot]], and deprecated any advanced microarchitecture such as [[OOO]],
since they felt sure that the [[simple 5-stage pipeline]] was the be-all and end-all.
It is not fair to ascribe these viewpoints to Professors Hennessy and Patterson
- their views are much more nuanced
- but nevertheless this viewpoint was common.

= Random Grab-bag =

This list was assembled from a recent email to an undergraduate (2011).
TBD: collect past recommended reading lists.

Hennessy and Patterson is required reading. However, I don't much like any of the textbooks - which is why I am working on my wiki, sort of as an antidote.

You pretty much have to read the conferences and journals:
* [[ISCA]]
* [[ASPLOS]]
* [[Micro]]
* [[HPCA]]
* [[Hotchips]]
* [[ISSCC]] - usually lower level than computer architecture, but oftentimes the only information you can get about industry chips

I suggest reading every issue of [[Microprocessor Report]] you can, although I warn you that there was a period when MPR was owned by EE Times when it was not very good. Back in the 80s and 90s, MPR was the bible, by installments; and since The Linley Group bought it back it has become much better than it was when EE Times owned it.

* the old [[Microprocessor Forum]] conference

As for reading the conferenves and journals, a good place to start would be the "Best of" issues:

* 25 Years of ISCA, Selected Papers, ed Guri Sohi

* 20 years of ACM SIGPLAN conferencvve on PLDA, 1979 - 1999, ed McKinley.

Other books on my shelf:

* Mead Conway VLSI - old, but classic. Read it.

* Laura Prince, Semiconductor Memories -old; I wish there was a better book on DRAM. Probably better to read papers.

* Bell and Newell,
* Bell Sieworik and Newell,
:: Computer Structures
::- this was the classsic when I was in school. Very old now.

* Mike Flynn, Computer Architecture
* Blauuw and Brooks, Computer Architecture

* Pfister - In search of clusters.

Many many instruction set manuals

* the IBM 360 Principles of Operation

* Hacker's Delight

* Inside the AS/400

Organick's books on computer hardware
* TBD: list

* The Soul of a New Machine

Use a library and used book store. I bought, and still buy, a lot of 2nd hand books.

* read manuals on http://bitsavers.org

= Best of Lists and Collections =

* papers that won the ACM SIGARCH and IEEE-CS TCCA ISCA Influential Paper Award, http://www.sigarch.org/influential_paper.html

= Bibliography Surfing =

Many students, especially graduate students assemble personal bibliographies and reference lists of papers they encountered in their research.
The bravest students may annotate the references, with discussions of value.
Many such lists can be found on the web.
Use them shamelessly for inspiration - but beware that your opinions may not agree with the bibliographers./

Unfortunately, more experienced professionals, professors, etc.
have usually learned to only provide positive recommendations,
and to never make public anti-recommendations
about authors who may eventually be recommending funding for the reviewer.

Still more experienced professionals may recommend papers that they dislike,
because not mentioning them would be an obvious dis.