Krazy Glew's Blog

Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Friday, April 01, 2011

Reset: Hard, Soft, Cold, Warm

http://semipublic.comp-arch.net/wiki/Reset:_Hard,_Soft,_Cold,_Warm
= The need for RESET after REST at power-on =

In the beginning there was [[RESET]]: a signal asserted while powering on, deasserted when power and logic levels were stable, so that the circuit could could be initialized.

Soon thereafter, or even before, there was POWEROK:
* !POWEROK and RESET => do not attempt operation
* POWEROK and RESET => power good, now do initialization
* POWEROK and !RESET => initialization done, running.

But let us not get obsessed by the details of signalling: I'm okay, you're okay, I've initialized and you are ready.
Moving forward: ...

Eventually people realized that they wanted to be able to restore the state of the system to that right after RESET, without going through the full power on sequence.
Hence the concept of [[soft reset]] was born.
On Intel x86 systems the INIT pin or meessage can be considered to be approximately a [[soft reset]].
With the old reset and power on constituting [[hard reset]].

But [[soft reset]]s like INIT cannot recover from all errors. Sometimes a computer is truly hung, and a [[power cycle]] is necessary.
Or, if not a power cycle, an assertion of the RESET signal
so that all ofthe rest of the system is initialized?

Did I mention that, perhaps before [[soft reset]], people would build circuits that could interrupt power with a relay, and thus provooke a power on [[hard reset]] under software control?

Trouble is, after a [[hard reset]] system state might be unreliable, or might be initialized to a true reset state, such as all zeros.
How can you then distinguish a true [[power on reset]] from a [[hard reset]] invoked under software control?
Perhaps used to recover from a system hang?

What you need is a softer hard reset - a [[warm reset]] that acts like a [[hard reset]], asserting the RESET signal to the rest of the system, I/O devices et al,
so that the state of the machine is as close as you can get to a true power on [[cold reset]] as possible.
But where power persists across the [[warm reset]],
so that at least certain status registers can be reliably read.

At first the state that persisted across such a [[warm reset]] might reside only in a battery backed up unit near to the reset state machine.
But when the amount of state that you want to persist grows large,
e.g. the [[MCA (Machine Check Architecture)]] error status log registers,
state inside the [[CPU]] may be allowed, indeed, required, to persist across such a [[warm reset]].

= A possible sequence of progressively harder RESETs for error recovery =

I am sure that you can see that [[cold reset]] versus [[warm reset]] and [[hard reset]] versus [[soft reset]] are not necessarily discrete points but,
as usual, may be points on a spectrum,
a not necessarily 1D ordered list of possible reset mechanisms.

If an error is detected,
e.g. if a processor stops responding
* first you might try sending it a normal interrupt, of progressively higher priority
* then you might try sending in an [[NMI (Non-Maskable Interrupt)]]
** then any of the flavors of [[even less maskable non-maskable interrupt]]s, such as Intel's [[SMI (System Management Interrupt]], some sort of [[VMM]] or [[hypervisor]] interrupt
* then you might try sending a [[soft reset]] message like INIT to the apparently hung processor
* this failing, you may try to do a [[warm reset]] of the hung processor
** although by this time you probably want to reset all of the processors and I/O that are tightly bound to the hung processor
** exactly how you define such a "reset domain" is system specific, although it is often the same as a [[shared memory cache coherency domain]].
* failing this, you might try to use a [[hard reset]] under the control of an external circuit, e.g. at the power supply, that can trip a relay and then untrip it after a time has elapsed
** heck, this can be done at power supply points increasingly distant: inside the PC or blade, at the rack, in the datacenter...
* all of these failing, you can try to notify the user, although by this point that probably is impossible; or you may rely on some external mechanism, such as a user or a watchdog, to try to use ever more extreme forms of resetting the system.

The above tends to imply that there is a linear order of reset mechanisms. This is not necessarily true. You may reset subsystems in heuristic order.

= RESET is a splitting concept =

I.e. overall [[RESET]] is one of those concepts that inevitably split,
whenever you look too closely at it.
Of course, create no more flavors of reset than you need, because each brings complexity.
But inevitably, if your system is successful and lasts for several years,
you will need one or two more flavors of RESET than were originally anticipated.

= Similar =

[[Watchdog timer]]s are a concept that splits similarly to RESET.
Indeed, each level of progressive reset forerror recovery of a hung system is often driven
by a new level of [[watchdog timer]].
Or [[sanity timer]].

(Or [neurosis timer]], or [[psychosis timer]] ... no, I am writing this on April Fool's.)

Thursday, March 31, 2011

Hardware data structures

Computer software education (at least, if not the whole field) was greatly influenced by Niklaus Wirth's 1976 book
"Algorithms + Data Structures = Programs".
This popularized the importance of a programmer or software practitioner having a toolkit of datastructures,
and algorithms that employ or operate on such datastructures.

The situation in computer hardware is somewhat inchoate, less well defined.
Nevertheless the same [[hardware data structures]]
crop up over and over again in different applications.

I am not really sure what "algorithms" would be in the hardware version of "Algorithms + Data Structures = Programs".
Nor, for that matter, what "programs" would be.
Modules? Devices?
Sure, hardware implements algorithms, but it is hard to tease apart the concept of algorithms and data structures.
Moderately hard for software; especially so for hardware.

However, we should always keep in mind that hardware "algorithms" are just like software algorithms,
and data structures, except that implementation in hardware has diofferent constraint:
* size is limited - software often treats program size and data memory suze as infinite
* communications is restricted: whereas in software any module can talk to any other module or object, in hardware only a limited number of nearby communications are cheap.

== List of [[hardware data structures]] ==
* [[Arrays]]
** [[RAMs]] versus [[CAMs]]
** [[Encoded CAMs versus Decoded CAMs]]
-
* [[CAM]]s
** [[Equality match CAM]]s
** [[Priority CAM]]s
** [[Greater than CAM]]s
** [[Range CAM]]s
** [[Bitmask CAM]]s, a more general case of [[Ternary CAM]]s
-
* [[Pipelines]] - I'm not sure that these should be considered a [[hardware data structure]]
** [[Stalling pipelines]]
** [[Stall-free pipelines]]
*** [[Draining pipelines]]
*** [[Replay pipelines]]
** [[Bubble collapsing pipelines]]
-
* [[Queues and Buffers]]
** [[Collapsing queues]]
-
* [[Circular arrays]]
-
* [[FIFOs]]
** [[True data movement FIFOs versus array-based circular buffers]]
** [[Synchronizers (asynchronous logic)]]
*** [[Synchronization and clock domain crossing FIFOs]]
**** [[AMD synchronization FIFO]]
**** [[Intel Bubble Generating FIFO (BGF)]]
-
** [[Alignment issues for queues, buffers, and pipelines]]
** [[Holes in queues, buffers, and pipelines]]
-
[[Caches]]
* [[Fully associative in-array CAM tag matching]]
* [[Separate tag and data arrays]]
* [[N-way associative tag matching outside array]]
** [[index sequential]] - tag match, and then index data array
** [[index parallel]] - read out tags and data in parallel, then do a late mux
** [[in-array tag matching for N-way associative arrays]]
-
* [[Bloom Filters]]
** [[M of N Bloom filter]]
** [[N x 1 of M Bloom filter]]
** [[Pipelined Bloom filter]]
** [[Counted Bloom filter]]
-
* [[Arrays of 2-bit saturating counters]] crop up again and again, in predictors and elsewhere
-
* [[Permutation bit matrix]]es
-
* Schedulers
** [[Pickers versus packers]]
** [[Prioritizers]]
*** [[Carry chain prioritizer]]
**** [[Pick most extreme on one side]] - [[pick oldest (or youngest)]]
**** [[Pick most extreme on both sides]] - [[pick oldest and youngest]]
*** [[CAM matching prioritizer]]
**** [[Pick N-th]] - [[pick N-th oldest]]
*** [[Prioritizers after permutation bit matrix]]

In the aftermath of [[Mead and Conway]]'s "VLSI revolution"
there was a flurry of work on arbitrary hardware datastructures,
often 1:1 corresponding to software datastructures.
These include
* [[hardware sorting networks]]
* [[hardware heap sorting structures]]
* [[true hardware stack]]s
* [[true hardware FIFO]]s
Unfortunately, many of these have not proved practical in most computer applications.

== Duality between Hardware and Software Data Structures ==

We have long observed a regular pattern of [[duality between hardware and software data structures]].

This can be observed
* When implementing a hardware microarchitecture inside a software simulator.
* When converting an algorithm implemented in software into hardware
Oftentimes the most direct implementatin of a hardware data structure in software is stupidly slow, but there are regular patterns of good equivalences.

Such duals include

Hardware	Software	Comments
Equality match CAM	Hash table
Range or Greater Than CAM	No good SW equivalent
[[PLA]]s	[[Lookup table (LUT)]]	[[HW]] can handle irregular lookup tables: different granularity of LUTs for different ranges. [[SW]] must

The duality is useful,
not just for implementing HW in SW or vice versa,
but also because trends in VLSI scaling have meant, more and more, that the "traditional" hardware datastructures
must be foregone.
E.g. limiting the number of bits on a bit-line in a [[RAM array]] or a [[CAM match bitline]]
often means that that hash table like approaches should be used.

== Dual or equivalent implementations for [[hardware data structures]] ==

* [[Arrays are equivalent to multiple flopped registers outputting to a MUX]]

== Other [[hardware data structures]] ==

The ShuffleBox: a rectangular array of bits that can store, XOR and rotate all bits in all directions.
* Hala A. Farouk
* An efficient hardware data structure for cryptographic applications
* http://www.universitypress.org.uk/journals/cc/cc-10.pdf

=== Shifters ===

Various shift implementations - [[barrel shifter]]s,
[[cascaded multistage shifter]]s
- "feel" to me like the hardware instantiation of an algorithm more than a data structure,
since they are stateless.

Perhaps I should limit the concept of a [[hardware data structure]] to something that can store state for retrieval later.

= References =
* http://en.wikipedia.org/wiki/Algorithms_%2B_Data_Structures_%3D_Programs
* Wirth, Niklaus (1976) (in English). Algorithms + Data Structures = Programs. Prentice Hall. ISBN 978-0-13-022418-7. 0130224189.

Intel BGF mentioned in:
* ftp://download.intel.com/design/processor/datashts/320835.pdf

* http://en.wikipedia.org/wiki/Mead_%26_Conway_revolution

Wednesday, March 02, 2011

Dynamic dead code elimination and hardware futures

[[Category:BS]]
http://semipublic.comp-arch.net/wiki/Dynamic_dead_code_elimination_and_hardware_futures
= [[Dynamic dead code elimination]] =

It has been shown that often more than 1/3 of all computations, data movement, etc., are dynamically dead - they will never be used by the program, on its current path of execution.

There have been hardware proposals that delay computations, and elide some such dynamically dead code. Something like the flip side of [[hardware futures]].

; [[Semi-static optimization]]

Dynamic dead code elimination may be done as a [[semi-static optimization]] in an [[optimized trace cache]].
Basically, one applies classic compiler algorithms to the code to be cached,
augmented by reachability predicates.

Or, another classic way of doing hardware optimization is to examine the post-retirement instruction stream, eliminate dead code, and install that in the [[optimized trace cache]].

; [[Rename time optimization]]

Dynamic dead code elimination may also be done in the [[renamer]] pipestage. The basic idea is, instead of emitting uops as the are decoded,one assembles blocks of uops in the renamer.
Essentially, large expressions, potentially with multiple inputs and outputs.
One does not emit such a block of uops from the renamer to execution until a usage of the block is detected.

Typically, a store that may be visible to another processor constitutes a usage that will cause uops to be emitted.
(Although certain stores can be elided, e.g. push/pop/push [[stack optimizations]] and [[temporary variable optimizations]].)

Similarly, the evaluation of a branch condition may cause uops to be emitted. But one only need emit the subset of the buffered uops necessary to evaluate the branch.

Exceptions and other situations whether the entire program state may cause uops to be emitted.

And overflow of the renamer buffer.

If the output of an expression buffered in the renamer is overwritten, that part of the expression can be eliminated as dead code.

; [[Dataflow optimization]]

[[Dynamic dead code elimination]] can also be done in the OOO dataflow scheduling part of the machine.
Essentially, one assembles the dataflow graph, but one does not necessarily start execution as soon as inputs are ready.

The renamer adds an additional [[WAW]] arc to the dataflow graph, for a new uop that overwrites an old uop's output.
If that old uop has no [[RAW]] dependencies, then it can "fire", cancelling itself,
and signalling that cancellation to its own input dependencies in turn.

This is essentially [[reverse dataflow execution]].
It is straightforward, although it does require hardware that is not normally present.
It is easiest to accomplish with a [[bit matrix]] scheduler, where the [[input bitvector]] becomes the [[output bitvector]] of a cancelled uop.

It is not at all clear how useful this form of dataflow dead code elimination is:
it is buildable, but how much does it help?
It is probably most useful if it can be remembered in an [[optimized trace cache]].

= [[Hardware futures]] =

[[Hardware futures]] are the dual of [[dynamic dead code elimination]].
Instead of eliminating code that does not need to be executed,
it delays execution and evaluation of code until it is known to be needed.
This delay naturally accomplishes [[dynamic dead code elimination]].
It also provides increased scope for [[hardware CSE]],
and other optimizations.
But it potentially increases latency.

As for [[dynamic dead code elimination]],
it is fairly easy to see how to do this at the renamer:
have the renamer accumulate expressions.
Only release an expression to be evaluated when a usage that demands evaluation is detected.

We can go further, and avoid the buffering at the renamer becoming an issue by emitting such buffered instructions for a hardware future
into a sort of side car storage, rather like that of Akkary's [[CFP]].
When evaluation is demanded, we issue a token to cause insertion of those instructions.

Issue: are the buffered instructions pre or post rename? It is easiest to buffer them post rename,
using dataflow single assignment registers and memory locations.

Also as above, this can be deferred into the OOO dataflow scheduler.
But, again, that is using a fairly critical resource.

To make hardware futures less dynamic, more [[semi-static]] at the [[optimized trace cache]]
involves dividing the future into two parts:
the first part captures any live inputs,
that must be captured lest they change;
the second part does the actual evaluation on demand.
Typically memory load values must be live inputs, as well as registers,
and the future is constrained from doing stores that may be visible to other threads.

I.e. the hardware future is constrained to be a chunk of code that has no internal memory references.
Or at least only sharply limited internal memory references.

It is unclear how useful such a future would be. It would have to be saving considerable register based computation
- complicated instructions such as transcendentals - or be a complicated calculation on a small amount of memory.

Such hardware futures might be much more useful if we were released from the constraints of the [[Von Neumann memory model]],
e.g. if we had classic dataflow style [[single assignment memory]].
But in 2011 that does not seem likely any time soon.

Using Redundancies to Find Errors

Cleaning out old journals from my bookshelf.

Came across

Yichen Xie, Dawson Engler, "Using Redundancies to Find Errors," IEEE Transactions on Software Engineering, pp. 915-928, October, 2003

(actually, a slightly different version in Software Engineering Notes / SIGSOFT 2002/FSE 10.

Very interesting paper: used analysis tools that detected dead code, redundant conditionals, idempotent operations such as x&x to detect bugs.

Causes me to wonder about applying similar techniques to look for hardware/RTL/HDL bugs, e.g. in Verilog or VHDL.

For that matter, it caused me to think about similar hardware techniques. For example, it has been shown that often more than 1/3 of all computations, data movement, etc., are dynamically dead - they will never be used by the program, on its current path of execution. There have been hardware proposals that delay computations, and elide some such dynamically dead code. Something like the flip side of hardware futures.

Now, such dynamic dead code detection *might* help locate bugs - but it suffers because it is dynamic, bound to the current path of execution. Whereas Xie's work looks at reachable code, including paths that may not be currently taken but which might use the values that are dead on the current path.

--

Done in xgcc, so code should be publicly available.

I wish I had had such tools in my last project.

Tuesday, March 01, 2011

Stream Processors, specifically Stream Descriptor Processors

http://semipublic.comp-arch.net/wiki/Stream_processor
{{Terminology Term}}
[[Category:ISA]]
[[Category:Memory]]

= [[Stream processor]] is an overloaded term =

The term [[stream processor]] is fashionable and overloaded.
It often means just a processor that is "optimized" for a [[stream workload]].
Where "optimized" may mean just "has little cache".
(E.g. the use of the term "stream processor" in [[GPU]]s such as tyhose by AMD/ATI and Nvidia.)

= What is a [[stream]]? =

A [[stream workload]] operates on sets of data,
operating on one or a few data elements
before moving on,
not to return to those data elements for a long time.
Because of this access pattern,
[[LRU replacement]] is often ineffective, and may even be pessimal,
for [[stream workload]]s.
Keeping cache entries around that are not going to be reused is wasteful.
Sometimes a stream is restarted, in which case an MRU policy that keeps the head of the stream in cache is better than LRU.
Sometimes there is no reuse at all.

= [[Stream descriptor processor]] =

By [[stream processor]], I mean processors with [[stream descriptor]]s
- processors that have specific features, whether ISA or microarchitecture,
that support streaming.
(I would prefer to reserve this term for ISA features, but [[many good ISA features become microarchitecture]]].)
Possibly should call them [[stream descriptor processor]]s,
to distinguish them from the overloaded term [[stream processor]].

= [[Stream descriptor]] =

A [[stream descriptor]] is a description, in a register or memory data structure, that describes a memory access pattern.

For example, a [[1D stream descriptor]] might include
* data item description - size, type
* base
* stride
* length
* current position

A [[2D stream descriptor]] might include
* data item description - size, type
* base
* horizontal stride, length
* vertical stride, length
* current position, row and column

Obviously extendable to N dimensions.

In addition to arrays, there may be other types of stream descriptor

A scatter/gather descriptor
* a 1D descriptor of a vector of addresses
* with an indication that each addressshould be dereferenced

A linked list descriptor
* Base
* a skeleton [[structure descriptor]]
** payload to be accessed while traversing the list - offset, type
** position within a list element of pointer to next element

A tree or B-tree descriptor
* Base
* a skeleton [[structure descriptor]]
** payload
** key
** pointers

For the case of a tree or B-tree descriptor,
the access pattern may
(a) traverse all of a tree or subtree
or (b) traverse only a path seeking an individual element

= Why [[stream processor]]s? =

There are two main motivations for stream processors:
* memory access patterns
* data manipulation overhead
I think there is an additional property, probably nort considered important by most, but [[IMHO]] quite important:
* decoupling and ease of programming

; Memory Access Patterns

Support for streaming memory access patterns is probably the most important.
As mentioned above, many [[stream workload]]s do not fit in the cache.
But that could be handled by a simple [[cache bypass hint]] instruction.

Of more long term interest is the possibility that [[stream descriptor]]s, etc., may allow
streams that fit in the cache to be managed.
As cache sizes grow, [[memory data structures]] that did not fit in an old, smaller, cache, may fit in a new, larger, cacge.
E.g. a 1000x1000x32b array does not fit in a 32K [[L1$]], but may fit in an 8M [[L3$]].
Of course, this cannot be managed in isolation:
a workload that accessed only one 2MB 2D array might fit in an 8MB cache, but one that accessed 4 such would not.
Nevertheless, explicitly indicating the data structures provides more information to whoever is trying to manage the cache,
whether hardware or software.

The [[stream descriptor]] may have a bit associated with it, either inside or [[non-adjacent]], that indicates which cache levels to use. Furthermore, even in the absence of cache control in the stream descriptor, the stream descriptor provides guidance for [[cache prefetch]].
Guidance that may be more accurate that [[SW prefetch using cache control instructions]].
A stream descriptor may be associated with a prefetch control data structure, that seeks always to avoid delays accessing a stream,
without excessive prefetch.

; Data Manipulation Overhead

Users of [[SIMD packed vector]] instruction sets quickly come to realize
that the actual memory access and computation is only a small part of the task of programming code to use such instructions.
Often more than half of the instructions in such kernels are
data manipulation overhead
* unpacking 8 bit numbers into 16 bit numbers, or, worse, even fancier unpackings such as 5:6:5
* transposng rows and columns
* [[chunky versus planar]] data, i.e. [[SOA versus AOS]]

Explicit support for streams via [[stream descriptor]]s can eliminate much of this overhead, avoiding explicit instructions.

[[The CSI Multimedia Architecture]] goes further: not only does it hide the data packing and unpacking,
but it also seeks to hide the [[SIMD packed vector]] nature of the hardware datapath.
Instead of the programmer loading from a stream into a register of N bits, operating, and then storing,
this paper proposes complex streaming instructions that are implemented N bits at a time
in such a way that the code does not need to be changed when the microarchitecture changes the width of N.

Others have attempted to go even further, providing ways of expressing streaming kernels using simple instructions
that automatically get mapped onto a fixed length datapath.

; Decoupling and ease of programming

Streams are the equivalent of UNIX pipes.

If you had an architecture that allowed [[lightweight thread forking]],
some code could be written by connecting two library function kernels together by a stream
- instead of
* writing two separate loops that store the intermediate value to memory
* or merging the two operations into a single loop for a single pass

PROBLEM: allocating buffer space for such inter-thread streams. It may be necessary to spill to memory in some situations.

= If streams are such a good idea, why aren;t they popular yet? =

TBD.

Proper [[stream descriptor processor]]s have not taken off by 2010.

Instead, [[SIMD packed vector]] has dominated.

= See Also =

;[[The CSI Multimedia Architecture]], IEEE T. VLSI, Jan 2005.
: [[Stream descriptor]]s, plus optimizations to allow implicit SIMDification.
* [[iWarp]] - stream descriptors bound to registers
* [[Wm]] - William Wulf's processor, firs instance I know of stream descriptors bound to register

Sunday, February 27, 2011

Unified_versus_Split_or_Typed_Register_Files

http://semipublic.comp-arch.net/wiki/Unified_versus_Split_or_Typed_Register_Files

There used to be a big debate in computer architecture about whether
one should have [[Unified versus Split Integera and Floating Point]].
Both ISA or macroarchitecurally, in the programmer exposed register
file. Or microarchitecturally, in the implementation.

E.g. the [[Intel P6 microarchitecture]] implemented the [[Intel
x86/x87 ISA]], with separate integer and floating point. But the
original P6 microarchitecture implemented this is a unified datapath,
in which both integer and floating point use the same scheduler, and
occupy the same [[PRF]] in the form of the unified [[ROB]], each entry
of which can hold a full 80+ bit x87 FP value. But they are split at
the [[RRF]].

Over time, Intel added [[SIMD packed vector extensions]]. [[MMX]]
shared the x86 registers, both architecturally and
microarchitecturally (in most implementations). But wider SIMD, such
as SSE with 128 bit [XMM]] regisyyers, and AVX with 256 bit [[YMM]]
registers, introduced yet another split architectureal register file.
Microarchitecturally, they may have kept the P6 unified structure, but
the wastage of storing an 8, 16, 32, or even 64 bit scalar in a 128 or
256 bit wide PRF entry becomes more and more of a concern.

AMD, by contrast, has typically had separate integer and FP/SIMD
clusters, with separate schedulers and PRFs and ...

Intel's 2011 processor, Sandybridge, has split integer and SIMD PRFs,
a unified scheduler, but split scalar and SIMD clusters.

It is becoming increasingly obvious that any desire to split datapaths
should be based not on type, not on integer versis floating point, but
should be based on width, intrger or scalar versus SIMD packed
vectors. Narrow data types, 8, 16, 32, 64 bits, versus wide packed
data types, 128, 256, possibly even 512 bits wide. Latency versus
throughput.

Currently it is pretty standard to separate

* integer, scalar, 8, 16, 32, and 64 bit data. Addresses and branches. Latency sensitive

* SIMD packed vector, both integer and FP. 128, 256, and possibly 512 bit data. Bandwidth.

The only real question is whether one should support scalar FP on a
datapath optimized for low latency, and interaction with branches and
addresses. Or whether scalar FP should only be handled as a single
lane in a vector register, as in SSE, in a throughput manner.

Now, there will always be workloads where FP scalar latency will
matter. For that matter, there will always be workloads where vector
latency matters. Nevertheless, this division into narrow latency and
wide throughput datapaths seems to be a trend.

It will always be desirable to be able to access the bits of an FP
number, whether scalar or vector, in an ad-hoc manner. It will always
be desirable to branch and form addresses based on such accesses. But
these desires will not always outweigh the separation into narrow
latency and wide throughput clusters.

Power concerns now are paramount. Split datapaths naturally save
power; but unified datapaths may be able to save power in an even
finer grained manner. Much depends on the granularity possible with
[[clock gating]] (which can be quite fine grained, e.g. to bytes of a
datapath) and [[power gating]] (which in 2011 is quite coarse grained,
unlikely to be done within the width of a unified datapath). For that
matter, separate narrow/latency and wide/throughput clusters would
probably be easier to clock at different frequencies, with different
coltages - although as of 2011 this does not seem yet to be done.

I can make no hard and fast conclusion: although there is a trend
towards split narrow/latency and wide/throughput datapaths, there will
still be occasional winning attempts to unifiy. Much depends on the
evolution of power, frequency, and voltage management, in terms of
granularity.

Why non-inclusive/non-exclusive caches?

Sheesh! I can't usually take all the time this article needed to write in one sitting.

http://semipublic.comp-arch.net/wiki/Why_non-inclusive/non-exclusive_caches

[[Category:Cache]]

[[Non-inclusive/non-exclusive (NI/NE)]] caches are used in
[[multilevel cache hierarchies]] that are, duh, neither [[inclusive]]
nor [[exclusive]].

I.e. there is no enforcement of either the [[cache inclusion
property]] nor the [[cache exclusion property]]. A cache line in an
inner cache may or may not be in an outer cache. You can't be sure.
If necessary, you must check.

Because there is no [[cache inclusion]], both the inner and outer
caches must be checked, i.e. snooped or probed, when responding to
requests from other processors.

= Historical Background - Accidental Inclusion in the Original [[P6]] =

Probably the most widely known processor that has an [[NI/NE]] cache
is the [[Intel P6 family]].

The original P6 had an external L2 cache, on a [[back-side bus]]. In
an ascii diagram (TBD, I need drawing in this wiki):

[[L2$]] <---> P([[I$,L1$]]) <---> [[FSB]]

In more detail

[[L2$]] <---------> [[SQ]] <---> [[FSB]]
^
|
[[FB]]
|
+-----+----+
| | |
v v v
[[I$]] [[P]] [[L1$]]

Most of the symbols used in the diagrams above are standard; as for
those that are not:

* [[SQ (super queue)]] - on the [[P6 Pentium Pro]] die, coordinated interaction with the [[Front side bus (FSB)]], the back side buss and the off-chip [[L2]], and [[DCU]], mainly interacting with the [[fill buffers (FB)]].

* [[FB (fill buffers)]] - a staging area under the control of the [[data cache unit (DCU)]] of the P6 cpu core, talking to the [[bus cluster]] and hence the [[SQ]], [[L2$]], and the outisde world via the [[FSB]], and the caches of the [[processor core]], the [[IFU]] and [[DCU]] arrays. Also talked directly to the processor core, to minimize latency.

The original P6 had a completely external [[L2$]]. Not just the
[[cache data array]]s, but also the [[cache tag array]]s.

Now, there were several reasons why the original P6 had [[NI/NE]] or
[[accidental inclusion]]:

== Backside L2$ ==

Because the L2$ was on a backside bus, it was not necessarily faster
to access than the L1 caches, the [[IFU]] and [[DCU]]. In fact, you
can easily imagine circumstances where it might have been slower to
[[snoop]] or [[probe]].

Note; having the L2$ tags on-chip might have made it possible to probe the L2$ faster than the L1$.
== [[Deterministic Snoop Latency]] ==

One of the design goals of the original P6 was to have [[deterministic
snoop latency]] - to be able to respond within a relatively small
fixed number of cycles to an snoop from remote processors on the
[[FSB]].

[[Cache inclusion]]] would get in the way of this. You would first
have to prove the L2$, and then, if a hit, probe the L1$. This
naturally leans towards a [[variable snoop latency]], and slows the
snoop down.

In the long run I think that [[deterministic snoop latency]] must be
considered a worthy goal, but a dead-end. I think even the original
P6 gave up on it, providing it in most cases, but also provided a
snoop stall as a way of handling cases where it could not be provided.
Nevertheless, the goal of [[dererministic snoop latency]] influenced
the design.

== L2$ relatively large compared to the L1$ ==

The external, off-chip but inside package, L2$ for the original P6
Pentium Pro was 512Kib, 1Mib, with some versions as small as 256KiB.

The DCU or L1 data cache was die dieted over the life of the project
from 64KiB in our dreams to 16 or 8KiB.

Thus, the L2 was typically 32x bigger than the L1 data cache,
and similarly wrt the instruction cache.

Although Intel has experience with [[exclusive cache]]s from earlier
processors, with a 32:1 or 16:1 size ratio, there is little point in
using an exclusive cache to try to increase cache hit rate. The
increase in the hit rate from having circa 3-66% more data is small.
Most of the time the data in the L1$ is included in the L2$ buyy
accident.

This property Wen-Hann Wang called [[accidental inclusion]]. Wen-Hann
Wang was the leader of the P6 cache protocol design, and the inventor
(in his PhD) of [[cache inclusion]].

== Simplifying Design ==

In truth I must admit that we started P6 expecting to build an
[[inclusive cache]] hierarchy. However, Wen-Hann's performance data
showed that [[accidental inclusion]] because of the relatively large
L2$ removed most of the performance incentive for inclusion. And
moving to [[NI/NE]] considerably simplified the cache protocol,
removing many boundary conditions.

For example, how do you handle cases where a cache line is being made
dirty in the L1 exactly as it is being evicted from the L2? ... It is
not impossible to handle these cases. But [[accidental inclusion]]
made it just plain easier to leave them decoupled.

By the way, at the time on P6 we did not really have a term for the cache inclusion properties of the P5 L2, or lack thereof. Wen-Hann called it [[accidental inclusion]], but that term has stuck only in Glew's memory. The term [[non-inclusive/non-exclusive (NI/NE)]] was created years later.
== [[Silent eviction of clean data]] ==

It is, of course, always necessary to perform a [[writeback]] when
[[evicting dirty data]].

When [[evicting clean data]], however, one can silently drop the clean
line. It is not always necessary to inform the outer cache.

Now, this isn't 100% linked to [[cache inclusion]] or [[NI/NE]]. One
can perform [[silent eviction of clean data]] in an [[inclusive
cache]]. However, doing so means that any [[inclusion tracking]] that
the outer cacheis performing, so that it knows which cachelines in the
outer cache are inside which inner cache(s), may be inaccurate.
I.e. [[possibly inaccurate inclusion tracking ]].

[[Accurate inclusion tracking]] is a [[slippery slope]]. It is not
necessary, but it has several nice properties. Worse, some people
implicitly assume [[accurate inclusion tracking]] in any [[inclsuive
cache]] design.

[[Accidental inclusion]] or [[NI/NE]] avoids this slippery slope
altogether. Which not only simplifies design, bit also avioids the
sort of incorrect assumptions that often arise on such a [[slippery slope]].

== Allocation in L1$ without L2$ interaction ==

[[Accidental inclusion]] enabled or made easier certain cache protocol
optimizations, in particular [[eliminating unnecesary RFOs]] for
full-line cache writes, without requiring L2 involvement.
Especially as used in [[P6 fast strings]].

This probably was not a design goal for most of the people involved,
but it was something I (Andy Glew) was happy to take advantage of.

The basic observation is that a sizeable fraction of all cache lines
are completely overwritten, without any of the old data being read.
Therefore, reading the old data via an [[RFO (read for ownership)]]
bus request is wasted bandwidth.

In operations such as [[block fill]], e.g. zeroing a page of memory,
50% of the data bus bandwidth is wasted by such unnecessary RFOs. In
[[block copy]], 33%, (Assuming all miss the cache - not that bad an
assumption for an 8 or 16Hob L1$, but also not unreasonable even on
many larger caches, for many [[block memory operations]].)

Even in non-block memory code, unnecessary RFOs occupy a significant
fraction of data bus bandwidth. I have measured speedups of 10-20% on
many simulated workloads, and even on a few [[FIB-edited chips]].

The biggest challenge with [[eliminating unnecessary RFOs]] is maintaining [[memory ordering]], in particular [[store ordering]]. It is easy to do on a [[weakly ordered]] system. I have invented several techniques for maintaining memory ordering, some of which are [[NDA]], but some of which can be discussed here. When I get around to writing them up. In my [[copious free time]].
The original P6 eliminated unnecessary RFOs using the [[deferred
invalidation cache protocol]] for [[fast strings]] [[block memory
operations]]. The bandwidth benefit was clear; unfortunately, the
overhead of microcode, in the absence of microcode branch prediction,
made fast strings not always the best approach. Since the original P6
[[fast strings]] have been hit and miss: when somebody in microcode
gets around to tuning them they can be a win, but if untuned they may
be a loss compared to the best hand rolled assembly code.

P6's [[WC memory type]] also eliminated unnecessary RFOs, but was not
allocated in the cache. Similarly the [[SSE streaming stores]] added
in later processors.

The [[deferred invalidate cache protocol]] continued in use all the
way to Intel's Nehalem processor, the last of the P6 family. Nehalem,
however, added an inclusive [[LLC]] with snoop filtering, killing the
deferred invalidate cache protocol.

So much for history. The point remains: not having an inclusive cache
enables certain cache protocol optimizations, such as [[eliminating
unnecessary RFOs]], that are much harderto buildin an inclusive cache
system.

== Optionally Removing the L2$ ==

The [[NI/NE]] cache structure makes it easier to optionally disable or
remove the [[L2$]]. This option has been explored several times.

Now, it is possible to do this with an inclusive cache structure by
"faking out" the absent L2$, always returning "Yes, the cache line is
included in all possible internal caches", such as the instruction and
data and other caches.

But it is very easy, in an inclusive cache hierarchy, to slip dfown
the slippery slope and assume that the L2$ is present. And then to
take advantage of the L2$ being present in a way that breaks when the
L2$ is absent. [[NI/NE]] seeems to avoid this.

== Enabling [Funky Caches]] and Decoupling Associativity ==

In an [[inclusive cache]] system, one tends to want the associativity
of the outer, inclusive, cache to be greater than or equal to the sum
of the associativities of the inner caches it "covers".

E.g. if you have a 16 way associative outer cache that is inclusive of
8 inner caches, all of 4 way associativity, and if the usual indexing
of the cache using bitfields is performed (which usually implies a
1:many relationship between inner sets and outer sets), then it is
possible that for some set So of the outer cache mapping to sets Si of
each of the inner caches, each inner cache may want to have a distimct
line in each of the 4 ways. Implying that the inner caches may "want"
to have 8*4 distinct lines. Which exceeds the 16 way associativity of
the outer cache.

[[NI/NE]] caches do not have this constraint on associativity.

Furthermore, when playing the [[Funky Cache Rag]], exploring
structures such as [[victim caches]] or [[prefetch caches]] or
specialized caches for different data types, [[cache inclusion]] once
again gets in the way. Often these [[Funky Caches]] are fully
assiociative. They do a great job of reducing glass jaws related to
cache associativity - e.g. a 16 or 32 entry victim cache makes a 4 way
cache look like it has a much higher effective associativity. But a
32 entry fully associative victim or prefetch cache may only be able
to use 16 entries, if all entries map to the same way in the outer
cache. [[Cache inclusion]] prevents some of the worst [[glass jaws]]
that these caches can fix, from being fixed.

Now, these interactions are not necessarily large, at least in average
effect, although they can be a fertiole source of [[glass jaws]].

However, they do couple the design of the outer cache and the inner
cache. They mean, for example, that the outer cache design team
cannot simply change associativity without checking with the inner
cache design team, which may be making assumptions based on the
earlier outer cache design team [[POR]] for associativity. It means
that the outer and inner caches are more tightly coupled. Which can
lead to more efficient designs. But which can also make it slower and
harder to innovate.

== Backwards Invalidation Traffic ==

There are two main methods to maintain [[cache inclusion]]:
[[backwards invalidation]], and the considerablly less practical
[[outer cache victim identification by inner cache]]. I will ignore
the latter.

[[Backwards invalidation]]s are simple and straightforward. But they
also consume power, affect performance on an outer to inner cache
interconnect that may be saturated. And, in any case, are an
incremental source of complexity.

[[NI/NE]] cache hierarchies eliminate [[backwards invalidation]]s.

Now, in the 2011 Intel processors that motivated this article,
Sandybridge, with an inclusive snoop filtering [[LLC]] but a [[NI/NE]]
[[MLC]], backwwards invalidations are still required for LLC
evictions. But not for MLC evictions.

Let's talk some scenarios:

In workloads that use mainly read-only data, [[NI/NE]] allows [[silent eviction of dirty data]], but [[cacheinclusion]] requires BIs.

In workloads that use a lot of shared data, there may be multiple BIs for every inclusive cache eviction.

Etc. etc.

It is by no means certain that power saved by not [[probe filtering]] is not outweighed by the power saved by eliminating BIs.
It all depends on the type of multiproceessing in the system.
If the coherency traffic from beyond the outer cache is small relative to the cache thrashing inside the system,
e.g. if you have 16 cores crunching on large data serts, but no other multicore CPU chips and relatively small I/O,
BI traffic may outweigh snoop filttering.
But if you are building a very large shared memory system with many other CPUs and much cache coherent I/O traffic,
then inclusion may win.

= Are [[NI/NE]] caches still important? =

Obviously, since [[NI/NE]] caches are still in use in the Intel
Sandybridge processors being released at the time of writing in 2011,
they are still important in a practical sense? But is this just
[[legacy computer architecture]] dating back to the original P6? If
they were desiged from scratch, would [[NI/NE]] still be built

Maybe... lacking access to full simulation results, it is hard to say.

However, many of the advantages of [[NI/NE]] relate to making design
easier, decoupling inner and outer cache design. This is probably one
source of the longevity of the [[P6 microarchitecture]], which is now
coming to an end with Sandybridge.

Certainly, the old advantage of [[eliminating unnecessary RFOs]] no
longer stands. But that was always one of the most minor advantagesof
[[NI/NE]].

Much depends on the system. [[Probe filtering]] [[coherency traffic]]
makes a big difference in large multiprocessor systems with many outer
caches. It also helps with large amounts of high bandwidth cache
coherent I/O, as sometimes arises with ill-tuned [[discrete GPU]].
(Most discrete GPUs are tuned to use [[non-cache coherent I/O]], if
accessing [[UMA]] or processor memory at all.)

But as multicore systems get more cores,
and GPUs are largely integrated,
there may be less bandwidth on the other side of the [[outermost cache]].

== Where does cache inclusion win? ==

=== Snoop or probe filtering ===

One big win for cache inclusion is [[snoop filtering]].

However, such [[probe filtering]] does not necessarily require that
the cache data be inclusive. A data-less [[probe filter]] or
[[ownership cache]], a structure that allows ownership to be tracked,
and which permits lines to be exclsuively owned even though in main
memory may provide the benefits of [[probe filtering]] without the
costs of cache inclusion on a relatively small data cache. At the
cost of a new [[hardware data structure]], the [[ownership cache]].

Basically, the ownership cache maintains the cache inclusion property,
not necessarily the data cache. Since the ownership cache maintains
only a few state bits per cache line, the ownership cache can covedr
much more memory than an ordinary data cache. It is therefore less
liable to the issues of cache associativity.

=== Power management ===

Probably the most important win for [[cache inclusion]] is wrt power
management - really just an elaboration of [[snoop filtering]].

If you have [[cache inclusion]] you can power down all of the inner
caches, putting them into a low power mode that maintains state, but
which does not snoop.

Whereas if you have a [[NI/NE]] cache, you either must keep snooping
the inner caches, or you must flush the inner caches when going into
such a [[no-snoop cache mode]].

Now, you always needto be able to do such cache flushes. But they
take time, and if they are not required you can enter tye low power
mode more quickly, and more often, and thereby save more power.

Note that an inclusive LLC in combination with a NI/NE [[MLC]] and [[FLCs]]
gives the best of both worlds in this regard.
Note also that a datalerss [[snoop filter]] also works.

= Conclusion =

Many people, taught in school about [[inclusive caches]], are
surprised to learn how common and important [[exclusive caches]] and
[[non-inclusive/non-exclusive caches]] are, or at least have been, in
history.

It is by no means clear that any approiach is always best. All 3
approaches have pros and cons, and win in some situations.