The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, September 07, 2010

Varieties of Cache - CompArch

Varieties of Cache - CompArch: "Varieties of Cache"

The basic principle of caching is to use a small, fast structure (typically an [[Array]])
improve the performance of a large slow, structure,
hopefully to approximate a large, fast, structure, without the expense.

However, the definition of "fast" may vary. And there may be other reasons to use a cache.

The most familiar caches are [[latency caches]].
E.g. an L1 data cache that may take 4 cycles to access, an L2 20 cycles, and main memory 200 cyckes.

Another useful cache is a [[bandwidth cache]], where the cache provides more bandwidth than the structure is is a cache for.
This is quite common - often latency and bandwidth caches are combined.
However, the use of caches to improve bandwidth sometimes leaves customers dispirited,
such as those on the [[High bandwidth Computing]] mailing list.

Another useful cache has many array read and write ports on the small cache, and fewer on the large structure. E.g. [[array port reduction]].
This is really a particular form of [[bandwidth cache]].

A cache is a fairly good [[read combining]] or [[write combining]] mechanism.

A [[power cache]] attempts to reduce power by keeping frequently accessed data items in the small structure.
this may be done even without an improvement in latency or bandwidth.

Finally(?), a reliability cache may keep certain items expected to have relaibility issues in the cache.
Particularly if reliability is related to frequency of access.

TBD: are there any other varieties of cache?

Array Port Reduction - CompArch

Array Port Reduction - CompArch: "Array Port Reduction"

AFAIK "RF Port Reduction" was invented by Dave Papworth on the Intel P6 circa 1991. At that time called "ROB Read Port Reduction". Based on observation that most uops' inputs were captured by the RS, while the ROB read ports were actually

* [[Register File Port Reduction]]
* [[Register Renamer Port Reduction]]

After the above, I note some generic trends:

Say that you have N operations, each with L inputs and K outputs. How do you avoid having to access
the array containing such inputs and outputs with N*L read ports and N*K write ports?

I observe that these techniques apply, not just to the RF and RR arrays mentioned above, bit also to other arrays:
* Generic Cache Arrays
** e.g. banking as a form of [[pseudo dual porting]]
* RF arrays
* [[Store Buffers]]
* [[Branch Predictor Arrays]], in particular [[BTB Arrays]]
* Nearly any large [[Queue or Buffer]], such as those in store and forward networks such as the memory hierarchy.
** See [[The Memory Hierarchy is a Store and Forward Network]]

Similar principles of array port reduction apply to all of these arrays.
Simplest is when the arrays are indexed solely and uniquely, e.g. by a memory address or register number.
One might say that these are the primary key.
More challenging is when the array has such a *spatial* index key,
but also has a *temporal* aspect.
E.g. "find the youngest store to the same address older than the present store".

Spatial/temporal may well be a useful distinction, although two dimensional X/Y may also apply.

= Bypassing =

The first principle is as mentioned in the examples of RF port reduction and renamer port reduction:
that group of N operations may not really need to do N*L reads and N*K writes.
Since we are often talking about instructions, some instructions may produce values that later instructions consume. I call this generically "bypassing": it may take the form of conventional execution unit bypassing, or dependency checks inside a block of instructions at the renamer. The P6 [[Capture RS]] that led to Papworth's original ROB Read Port Reduction is a form of [[bypass cache]] in some sense.

= Banking =

Classic spatial banking: banks in a pseudo-multiported cache. Banks in a large register file, such as a GPU.

Banking often seeks to do the opposite of [[spatial locality]] - you try to arrange the banking function, usually a bitfield extraction, occasionally a hash - so that locations likely to occur together are likely to occur in different banks, not the same.

Temporal banking - or otherwise exploiting temporal proximity. E.g. as described in [[Register Renamer Port Reduction]], where uops' inputs are compared to neighbouring older uops first, and then to more distant uops only if the recent predecessor comparison fails.

If these comparisons are done in batches with hard boundaries, it is a form of temporal banking.
If these comparisons are done smoothly - all neighbous compared, etc. - then it is a different way of exploiting temporal proximity, not banking.
Intermediate between these is to divide the temporally sorted array into banks,
and to always compare an element to two or three banks:
* 3 banks: the bank the element is in, the preceding and succeeding bank.
* 2 banks:
** the bank the unit is in
** the preceding bank if the element is close to the beginning of the bank
** the succeeding bank if the element is close to the end of the bank
** possibly no other bank, if the element is in the middle of its own bank.

    E.g. based on the two most significant bits of the sequence number
    00bb...b => compare to this bank plus preceding
    11bb...b => compare to this bank plus succeeding
    01bb...b OR 10bb...b => don't compare
    or 3 bits ...

    [[AG Novelty?]] - I've never seen this proposed before. It seems so obvious...

= Read and Write Combining =

TBD: combining circuits:

;(1) CAMs: compare all operations to all operations

;(2) Choose N oldest (or highest priority) unique. Then CAM for combining.

;(3) Spatial hash to queues.
Within the queues
* or just compare nearest neighbours

;(4) Sorting networking

;Sort by index.
Choose N lowest.
ISSUE: loss of age or priority info

;Sort by index, spatially
;Also sort by priority, e.g. time.
(Maybe different sorts, time and priority)

Spatial sort is of tuples (index,seqnum).
When matching indexes (index,seqnum1), (index,seqnum2) link seqnum2 to seqnum1,
and only keep (index,seqnum1) in the sorting datastructure.

Equivalent, keep redundant values in the temporal sort.

Compute dominators in the temporal sort. Extract the N highest priority dominatos in the temporal sort.

Sounds expensive. Less expensive time-wise if thinking about special sorting hardware.


= Caches =

Small highly ported array,
combined with a larger, less ported array.

May exploit temporal and/or spatial locality.

E.g. may have 1r/1w port on an array of large cache line of 64 bytes,
but have more, narrower, ports within a smaller array.

May use such a cache not to improve latency, but solely to manage ports.
Which are a special form of bandwidth cache. (See [[Varieties of Cache]].)

= TBD =

Q: what other forms of array port reduction have I encountered?

Register Renamer Port Reduction - CompArch

Register Renamer Port Reduction - CompArch: "Register Renamer Port Reduction"

On comp.arch circa August 2010, Mitch Alsup a little while back said something like "It isn't the ports on the register file that are a problem, it's the ports on the renamer." With, IIRC, a comment that you had to squeeze the renamer into a single pipestage.

Not true. This reminds me of a comment by Tim Olson, then of AMD (and active on comp.arch), when I presented HaRRM, my Hardware Register Renaming Mechanism (which is basically the modern form of renaming) to him. He said that it would lose because it required an extra pipestage.

First, renamer ports: the renamer is a much smaller structure than the PRF, in terms of number of entries times number of bits per entry. This means that the wires involved are shorter, even though the logic depth, Mitch's preferred metric in most of his posts, is similar. Wires are a major contributor to delay.

Second, "port reduction" techniques can be applied to the renamer just as they can be applied to the register file. True, out-of-order execution may allow the execution units to pick up a higher fraction of their inputs on bypasses, whereas the renamer is essentially an in-order pipestage, and in-order means less port reduction. Nevertheless, there's a lot to latch on to.

= Components of Register Renaming =

Register renaming has three main parts:

a) bypassing: in a block of instructions, comparing the output registers of older instructions to the input registers of younger instructions

b) lookup: if a register is live-in to a block, i.e. if it is not bypassed, then looking it up in the renamer/map table.

c) allocation: if a register is written within the block, giving it a new physical register (and updating the array); also, arranging for this newly allocated register to be used as inputs by younger instructions in the block

= Lookup =

Lookup is the part that most people think of as accessing a highly ported array. E.g. if you are renaming 4 instructions per cycle, with 2 inputs, you might be looking up 8 inputs per cycle, and writing 4 new outputs.

But observe that whenever there is bypassing you don't need lookups.

For example, consider the code sequence

ADD eax += M[ecx+edx]
tmp := load(ecx+edx)
eax := eax + tmp
CMP eax,ebx
JNE target

The above group of 4 uops only has 4 live-ins, not 8. Only 4 values need to be looked up in the RAT array. The remaining 3 input values (tmp, eax in the CMP, JNE) are produced by earlier uops in the same block.

True, P6 brute forced this: we renamed three uops per cycle, we always looked up 6, one for each register operand - and we threw away any lookup that was bypassed within the block. This avoided the need to first do the bypassing, and then

However, even before P6, when I was working on this form of register renaming at UIUC during my MSEE, I anticipated ways to avoid this. Of course, that was back when I was trying to create 16-wide superscalar machines, so reducing complexity was highly required.

== Non-Blocking Limited Port Register Renaming ==

IMHO the most elegant way is similar to what is also IMHO the most elegant way to do RF read port reduction:

1) do the bypass comparisons of uop output to uop input earlier, possibly as a separate pipestage.

2) out of this previous work, generate a list of live-ins that need to be looked up. Have a limited number of lookup ports. Of course use port combining, so multiple reads of the same register live-in only get looked up once - although possibly only after some time.

The question is, what happens when you run out of lookup ports? You could do an in-order stall. But, being me, I like to let later instructions that are not dependent on one of the blocked lookups get renamed and proceed to the scheduler. I.e. I want to do out of order renaming. Actually, I think the easiest way to do this is to let the guys blocked on a renamer lookup port proceed to the scheduler, but to have a separate write port that fills in the physical register input later. This is not actually out-of-order: all uops are renamed in-order, but some are immediately renamed to physical registers, while others are renamed to placeholders that later get filled in with physical register numbers. I.e. the renaming is in-order, but the renaming all the way to physical registers may be out of order. (In my most BS-full moods I have imagined that the renaming is actually done out of an OOO scheduler, since even the bypass comparisons might be done with a non-full network, as will be discussed next.)

= Caching Renamed Instructions =

And, oh yes: it is trivial to imagine caching the renames, the results of the bypass comparisons, in a decoded instruction cache / trace cache. That is one of the reasons I worked on trace cache, way back then. You get blocks of instructions - possibly fixed size blocks, possibly variable length traces - with N live-ins that need to be looked up, and M internally bypassed values that are given relative numbers, i.e. "use the K-th register that is allocated to this block."

= Bypassing Renames Inside Instruction Blocks =

Now, how about the bypassing? Apart from the
caching thereof, mentioned above.
Naively, that's O(N^2) comparators:
1st uop
- lookup all of its inputs (some combining)
- compare its output to all N-1 younger uop's inputs
2nd up
- if its inputs are not bypassed from uop #1
lookup all of its inputs (some combining)
- compare its output to all N-2 younger uop's inputs

Dana Henry and Brad Kuzsmaul's Ultrascalar showed that you can do this in O(lg N) work. But that's asymptotic, for really big machines.

== Per Logical Register Renaming Circuit ==

Here's another way to do bypassing, suitable for machines with small register sets:
* create a "carry chain" for each logical register
* 2 bits, states "not used", "written", "needs to be looked up"
* insert not used at the oldest instruction
* pull down whenever written by a uop, in the sense
older next
nu nu => nu
lu * => lu
w * => w

This is O(N) delay, O(N*L) hardware, where N= number of uops, L=number of logical registers. It suffices to tell you what registers are live-in, and need to be looked up.

If you want, you can make the values inserted be "written by uop #k",
and thereby establish the dataflow. Increases the hardware cost by a factor, i.e. multiplying, by *log2(N) bits. You take those sequence numbers, and lookup a queue of free registers, and/or add to a base (the latter if doing a P6-style circular allocation in a ROB).

(I've heard that some companies did freelists as CAMs. Enough so that they veered away from freelists in subsequent processors. Silly people.)

Of course, O(N*L) is not a win over O(N^2) unless L is small. Which it might have been, back when I imagined doing this for a 16-wide x86. But which is no longer, now that N seems to be stuck around 4, and L is 32 or more.

== Incomplete Comparisons ==

Note that you can also do an incomplete comparison: e.g. compare most uops to their immediate predecessor, which catches many dependencies. Queue up requests to do comparisons that are further apart on a smaller set of comparators.