The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, February 27, 2011

Why non-inclusive/non-exclusive caches?

Sheesh! I can't usually take all the time this article needed to write in one sitting.



[[Non-inclusive/non-exclusive (NI/NE)]] caches are used in
[[multilevel cache hierarchies]] that are, duh, neither [[inclusive]]
nor [[exclusive]].

I.e. there is no enforcement of either the [[cache inclusion
property]] nor the [[cache exclusion property]]. A cache line in an
inner cache may or may not be in an outer cache. You can't be sure.
If necessary, you must check.

Because there is no [[cache inclusion]], both the inner and outer
caches must be checked, i.e. snooped or probed, when responding to
requests from other processors.

= Historical Background - Accidental Inclusion in the Original [[P6]] =

Probably the most widely known processor that has an [[NI/NE]] cache
is the [[Intel P6 family]].

The original P6 had an external L2 cache, on a [[back-side bus]]. In
an ascii diagram (TBD, I need drawing in this wiki):

[[L2$]] <---> P([[I$,L1$]]) <---> [[FSB]]

In more detail

[[L2$]] <---------> [[SQ]] <---> [[FSB]]
| | |
v v v
[[I$]] [[P]] [[L1$]]

Most of the symbols used in the diagrams above are standard; as for
those that are not:

* [[SQ (super queue)]] - on the [[P6 Pentium Pro]] die, coordinated interaction with the [[Front side bus (FSB)]], the back side buss and the off-chip [[L2]], and [[DCU]], mainly interacting with the [[fill buffers (FB)]].

* [[FB (fill buffers)]] - a staging area under the control of the [[data cache unit (DCU)]] of the P6 cpu core, talking to the [[bus cluster]] and hence the [[SQ]], [[L2$]], and the outisde world via the [[FSB]], and the caches of the [[processor core]], the [[IFU]] and [[DCU]] arrays. Also talked directly to the processor core, to minimize latency.

The original P6 had a completely external [[L2$]]. Not just the
[[cache data array]]s, but also the [[cache tag array]]s.

Now, there were several reasons why the original P6 had [[NI/NE]] or
[[accidental inclusion]]:

== Backside L2$ ==

Because the L2$ was on a backside bus, it was not necessarily faster
to access than the L1 caches, the [[IFU]] and [[DCU]]. In fact, you
can easily imagine circumstances where it might have been slower to
[[snoop]] or [[probe]].

    Note; having the L2$ tags on-chip might have made it possible to probe the L2$ faster than the L1$.

== [[Deterministic Snoop Latency]] ==

One of the design goals of the original P6 was to have [[deterministic
snoop latency]] - to be able to respond within a relatively small
fixed number of cycles to an snoop from remote processors on the

[[Cache inclusion]]] would get in the way of this. You would first
have to prove the L2$, and then, if a hit, probe the L1$. This
naturally leans towards a [[variable snoop latency]], and slows the
snoop down.

In the long run I think that [[deterministic snoop latency]] must be
considered a worthy goal, but a dead-end. I think even the original
P6 gave up on it, providing it in most cases, but also provided a
snoop stall as a way of handling cases where it could not be provided.
Nevertheless, the goal of [[dererministic snoop latency]] influenced
the design.

== L2$ relatively large compared to the L1$ ==

The external, off-chip but inside package, L2$ for the original P6
Pentium Pro was 512Kib, 1Mib, with some versions as small as 256KiB.

The DCU or L1 data cache was die dieted over the life of the project
from 64KiB in our dreams to 16 or 8KiB.

Thus, the L2 was typically 32x bigger than the L1 data cache,
and similarly wrt the instruction cache.

Although Intel has experience with [[exclusive cache]]s from earlier
processors, with a 32:1 or 16:1 size ratio, there is little point in
using an exclusive cache to try to increase cache hit rate. The
increase in the hit rate from having circa 3-66% more data is small.
Most of the time the data in the L1$ is included in the L2$ buyy

This property Wen-Hann Wang called [[accidental inclusion]]. Wen-Hann
Wang was the leader of the P6 cache protocol design, and the inventor
(in his PhD) of [[cache inclusion]].

== Simplifying Design ==

In truth I must admit that we started P6 expecting to build an
[[inclusive cache]] hierarchy. However, Wen-Hann's performance data
showed that [[accidental inclusion]] because of the relatively large
L2$ removed most of the performance incentive for inclusion. And
moving to [[NI/NE]] considerably simplified the cache protocol,
removing many boundary conditions.

For example, how do you handle cases where a cache line is being made
dirty in the L1 exactly as it is being evicted from the L2? ... It is
not impossible to handle these cases. But [[accidental inclusion]]
made it just plain easier to leave them decoupled.

    By the way, at the time on P6 we did not really have a term for the cache inclusion properties of the P5 L2, or lack thereof. Wen-Hann called it [[accidental inclusion]], but that term has stuck only in Glew's memory. The term [[non-inclusive/non-exclusive (NI/NE)]] was created years later.

== [[Silent eviction of clean data]] ==

It is, of course, always necessary to perform a [[writeback]] when
[[evicting dirty data]].

When [[evicting clean data]], however, one can silently drop the clean
line. It is not always necessary to inform the outer cache.

Now, this isn't 100% linked to [[cache inclusion]] or [[NI/NE]]. One
can perform [[silent eviction of clean data]] in an [[inclusive
cache]]. However, doing so means that any [[inclusion tracking]] that
the outer cacheis performing, so that it knows which cachelines in the
outer cache are inside which inner cache(s), may be inaccurate.
I.e. [[possibly inaccurate inclusion tracking ]].

[[Accurate inclusion tracking]] is a [[slippery slope]]. It is not
necessary, but it has several nice properties. Worse, some people
implicitly assume [[accurate inclusion tracking]] in any [[inclsuive
cache]] design.

[[Accidental inclusion]] or [[NI/NE]] avoids this slippery slope
altogether. Which not only simplifies design, bit also avioids the
sort of incorrect assumptions that often arise on such a [[slippery slope]].

== Allocation in L1$ without L2$ interaction ==

[[Accidental inclusion]] enabled or made easier certain cache protocol
optimizations, in particular [[eliminating unnecesary RFOs]] for
full-line cache writes, without requiring L2 involvement.
Especially as used in [[P6 fast strings]].

This probably was not a design goal for most of the people involved,
but it was something I (Andy Glew) was happy to take advantage of.

The basic observation is that a sizeable fraction of all cache lines
are completely overwritten, without any of the old data being read.
Therefore, reading the old data via an [[RFO (read for ownership)]]
bus request is wasted bandwidth.

In operations such as [[block fill]], e.g. zeroing a page of memory,
50% of the data bus bandwidth is wasted by such unnecessary RFOs. In
[[block copy]], 33%, (Assuming all miss the cache - not that bad an
assumption for an 8 or 16Hob L1$, but also not unreasonable even on
many larger caches, for many [[block memory operations]].)

Even in non-block memory code, unnecessary RFOs occupy a significant
fraction of data bus bandwidth. I have measured speedups of 10-20% on
many simulated workloads, and even on a few [[FIB-edited chips]].

    The biggest challenge with [[eliminating unnecessary RFOs]] is maintaining [[memory ordering]], in particular [[store ordering]]. It is easy to do on a [[weakly ordered]] system. I have invented several techniques for maintaining memory ordering, some of which are [[NDA]], but some of which can be discussed here. When I get around to writing them up. In my [[copious free time]].

The original P6 eliminated unnecessary RFOs using the [[deferred
invalidation cache protocol]] for [[fast strings]] [[block memory
operations]]. The bandwidth benefit was clear; unfortunately, the
overhead of microcode, in the absence of microcode branch prediction,
made fast strings not always the best approach. Since the original P6
[[fast strings]] have been hit and miss: when somebody in microcode
gets around to tuning them they can be a win, but if untuned they may
be a loss compared to the best hand rolled assembly code.

P6's [[WC memory type]] also eliminated unnecessary RFOs, but was not
allocated in the cache. Similarly the [[SSE streaming stores]] added
in later processors.

The [[deferred invalidate cache protocol]] continued in use all the
way to Intel's Nehalem processor, the last of the P6 family. Nehalem,
however, added an inclusive [[LLC]] with snoop filtering, killing the
deferred invalidate cache protocol.

So much for history. The point remains: not having an inclusive cache
enables certain cache protocol optimizations, such as [[eliminating
unnecessary RFOs]], that are much harderto buildin an inclusive cache

== Optionally Removing the L2$ ==

The [[NI/NE]] cache structure makes it easier to optionally disable or
remove the [[L2$]]. This option has been explored several times.

Now, it is possible to do this with an inclusive cache structure by
"faking out" the absent L2$, always returning "Yes, the cache line is
included in all possible internal caches", such as the instruction and
data and other caches.

But it is very easy, in an inclusive cache hierarchy, to slip dfown
the slippery slope and assume that the L2$ is present. And then to
take advantage of the L2$ being present in a way that breaks when the
L2$ is absent. [[NI/NE]] seeems to avoid this.

== Enabling [Funky Caches]] and Decoupling Associativity ==

In an [[inclusive cache]] system, one tends to want the associativity
of the outer, inclusive, cache to be greater than or equal to the sum
of the associativities of the inner caches it "covers".

E.g. if you have a 16 way associative outer cache that is inclusive of
8 inner caches, all of 4 way associativity, and if the usual indexing
of the cache using bitfields is performed (which usually implies a
1:many relationship between inner sets and outer sets), then it is
possible that for some set So of the outer cache mapping to sets Si of
each of the inner caches, each inner cache may want to have a distimct
line in each of the 4 ways. Implying that the inner caches may "want"
to have 8*4 distinct lines. Which exceeds the 16 way associativity of
the outer cache.

[[NI/NE]] caches do not have this constraint on associativity.

Furthermore, when playing the [[Funky Cache Rag]], exploring
structures such as [[victim caches]] or [[prefetch caches]] or
specialized caches for different data types, [[cache inclusion]] once
again gets in the way. Often these [[Funky Caches]] are fully
assiociative. They do a great job of reducing glass jaws related to
cache associativity - e.g. a 16 or 32 entry victim cache makes a 4 way
cache look like it has a much higher effective associativity. But a
32 entry fully associative victim or prefetch cache may only be able
to use 16 entries, if all entries map to the same way in the outer
cache. [[Cache inclusion]] prevents some of the worst [[glass jaws]]
that these caches can fix, from being fixed.

Now, these interactions are not necessarily large, at least in average
effect, although they can be a fertiole source of [[glass jaws]].

However, they do couple the design of the outer cache and the inner
cache. They mean, for example, that the outer cache design team
cannot simply change associativity without checking with the inner
cache design team, which may be making assumptions based on the
earlier outer cache design team [[POR]] for associativity. It means
that the outer and inner caches are more tightly coupled. Which can
lead to more efficient designs. But which can also make it slower and
harder to innovate.

== Backwards Invalidation Traffic ==

There are two main methods to maintain [[cache inclusion]]:
[[backwards invalidation]], and the considerablly less practical
[[outer cache victim identification by inner cache]]. I will ignore
the latter.

[[Backwards invalidation]]s are simple and straightforward. But they
also consume power, affect performance on an outer to inner cache
interconnect that may be saturated. And, in any case, are an
incremental source of complexity.

[[NI/NE]] cache hierarchies eliminate [[backwards invalidation]]s.

Now, in the 2011 Intel processors that motivated this article,
Sandybridge, with an inclusive snoop filtering [[LLC]] but a [[NI/NE]]
[[MLC]], backwwards invalidations are still required for LLC
evictions. But not for MLC evictions.

Let's talk some scenarios:

In workloads that use mainly read-only data, [[NI/NE]] allows [[silent eviction of dirty data]], but [[cacheinclusion]] requires BIs.

In workloads that use a lot of shared data, there may be multiple BIs for every inclusive cache eviction.

Etc. etc.

It is by no means certain that power saved by not [[probe filtering]] is not outweighed by the power saved by eliminating BIs.
It all depends on the type of multiproceessing in the system.
If the coherency traffic from beyond the outer cache is small relative to the cache thrashing inside the system,
e.g. if you have 16 cores crunching on large data serts, but no other multicore CPU chips and relatively small I/O,
BI traffic may outweigh snoop filttering.
But if you are building a very large shared memory system with many other CPUs and much cache coherent I/O traffic,
then inclusion may win.

= Are [[NI/NE]] caches still important? =

Obviously, since [[NI/NE]] caches are still in use in the Intel
Sandybridge processors being released at the time of writing in 2011,
they are still important in a practical sense? But is this just
[[legacy computer architecture]] dating back to the original P6? If
they were desiged from scratch, would [[NI/NE]] still be built

Maybe... lacking access to full simulation results, it is hard to say.

However, many of the advantages of [[NI/NE]] relate to making design
easier, decoupling inner and outer cache design. This is probably one
source of the longevity of the [[P6 microarchitecture]], which is now
coming to an end with Sandybridge.

Certainly, the old advantage of [[eliminating unnecessary RFOs]] no
longer stands. But that was always one of the most minor advantagesof

Much depends on the system. [[Probe filtering]] [[coherency traffic]]
makes a big difference in large multiprocessor systems with many outer
caches. It also helps with large amounts of high bandwidth cache
coherent I/O, as sometimes arises with ill-tuned [[discrete GPU]].
(Most discrete GPUs are tuned to use [[non-cache coherent I/O]], if
accessing [[UMA]] or processor memory at all.)

But as multicore systems get more cores,
and GPUs are largely integrated,
there may be less bandwidth on the other side of the [[outermost cache]].

== Where does cache inclusion win? ==

=== Snoop or probe filtering ===

One big win for cache inclusion is [[snoop filtering]].

However, such [[probe filtering]] does not necessarily require that
the cache data be inclusive. A data-less [[probe filter]] or
[[ownership cache]], a structure that allows ownership to be tracked,
and which permits lines to be exclsuively owned even though in main
memory may provide the benefits of [[probe filtering]] without the
costs of cache inclusion on a relatively small data cache. At the
cost of a new [[hardware data structure]], the [[ownership cache]].

Basically, the ownership cache maintains the cache inclusion property,
not necessarily the data cache. Since the ownership cache maintains
only a few state bits per cache line, the ownership cache can covedr
much more memory than an ordinary data cache. It is therefore less
liable to the issues of cache associativity.

=== Power management ===

Probably the most important win for [[cache inclusion]] is wrt power
management - really just an elaboration of [[snoop filtering]].

If you have [[cache inclusion]] you can power down all of the inner
caches, putting them into a low power mode that maintains state, but
which does not snoop.

Whereas if you have a [[NI/NE]] cache, you either must keep snooping
the inner caches, or you must flush the inner caches when going into
such a [[no-snoop cache mode]].

Now, you always needto be able to do such cache flushes. But they
take time, and if they are not required you can enter tye low power
mode more quickly, and more often, and thereby save more power.

Note that an inclusive LLC in combination with a NI/NE [[MLC]] and [[FLCs]]
gives the best of both worlds in this regard.
Note also that a datalerss [[snoop filter]] also works.

= Conclusion =

Many people, taught in school about [[inclusive caches]], are
surprised to learn how common and important [[exclusive caches]] and
[[non-inclusive/non-exclusive caches]] are, or at least have been, in

It is by no means clear that any approiach is always best. All 3
approaches have pros and cons, and win in some situations.

No comments: