The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, September 07, 2010

Array Port Reduction - CompArch

Array Port Reduction - CompArch: "Array Port Reduction"

AFAIK "RF Port Reduction" was invented by Dave Papworth on the Intel P6 circa 1991. At that time called "ROB Read Port Reduction". Based on observation that most uops' inputs were captured by the RS, while the ROB read ports were actually

* [[Register File Port Reduction]]
* [[Register Renamer Port Reduction]]

After the above, I note some generic trends:

Say that you have N operations, each with L inputs and K outputs. How do you avoid having to access
the array containing such inputs and outputs with N*L read ports and N*K write ports?

I observe that these techniques apply, not just to the RF and RR arrays mentioned above, bit also to other arrays:
* Generic Cache Arrays
** e.g. banking as a form of [[pseudo dual porting]]
* RF arrays
* [[Store Buffers]]
* [[Branch Predictor Arrays]], in particular [[BTB Arrays]]
* Nearly any large [[Queue or Buffer]], such as those in store and forward networks such as the memory hierarchy.
** See [[The Memory Hierarchy is a Store and Forward Network]]

Similar principles of array port reduction apply to all of these arrays.
Simplest is when the arrays are indexed solely and uniquely, e.g. by a memory address or register number.
One might say that these are the primary key.
More challenging is when the array has such a *spatial* index key,
but also has a *temporal* aspect.
E.g. "find the youngest store to the same address older than the present store".

Spatial/temporal may well be a useful distinction, although two dimensional X/Y may also apply.

= Bypassing =

The first principle is as mentioned in the examples of RF port reduction and renamer port reduction:
that group of N operations may not really need to do N*L reads and N*K writes.
Since we are often talking about instructions, some instructions may produce values that later instructions consume. I call this generically "bypassing": it may take the form of conventional execution unit bypassing, or dependency checks inside a block of instructions at the renamer. The P6 [[Capture RS]] that led to Papworth's original ROB Read Port Reduction is a form of [[bypass cache]] in some sense.

= Banking =

Classic spatial banking: banks in a pseudo-multiported cache. Banks in a large register file, such as a GPU.

Banking often seeks to do the opposite of [[spatial locality]] - you try to arrange the banking function, usually a bitfield extraction, occasionally a hash - so that locations likely to occur together are likely to occur in different banks, not the same.

Temporal banking - or otherwise exploiting temporal proximity. E.g. as described in [[Register Renamer Port Reduction]], where uops' inputs are compared to neighbouring older uops first, and then to more distant uops only if the recent predecessor comparison fails.

If these comparisons are done in batches with hard boundaries, it is a form of temporal banking.
If these comparisons are done smoothly - all neighbous compared, etc. - then it is a different way of exploiting temporal proximity, not banking.
Intermediate between these is to divide the temporally sorted array into banks,
and to always compare an element to two or three banks:
* 3 banks: the bank the element is in, the preceding and succeeding bank.
* 2 banks:
** the bank the unit is in
** the preceding bank if the element is close to the beginning of the bank
** the succeeding bank if the element is close to the end of the bank
** possibly no other bank, if the element is in the middle of its own bank.

    E.g. based on the two most significant bits of the sequence number
    00bb...b => compare to this bank plus preceding
    11bb...b => compare to this bank plus succeeding
    01bb...b OR 10bb...b => don't compare
    or 3 bits ...

    [[AG Novelty?]] - I've never seen this proposed before. It seems so obvious...

= Read and Write Combining =

TBD: combining circuits:

;(1) CAMs: compare all operations to all operations

;(2) Choose N oldest (or highest priority) unique. Then CAM for combining.

;(3) Spatial hash to queues.
Within the queues
* or just compare nearest neighbours

;(4) Sorting networking

;Sort by index.
Choose N lowest.
ISSUE: loss of age or priority info

;Sort by index, spatially
;Also sort by priority, e.g. time.
(Maybe different sorts, time and priority)

Spatial sort is of tuples (index,seqnum).
When matching indexes (index,seqnum1), (index,seqnum2) link seqnum2 to seqnum1,
and only keep (index,seqnum1) in the sorting datastructure.

Equivalent, keep redundant values in the temporal sort.

Compute dominators in the temporal sort. Extract the N highest priority dominatos in the temporal sort.

Sounds expensive. Less expensive time-wise if thinking about special sorting hardware.


= Caches =

Small highly ported array,
combined with a larger, less ported array.

May exploit temporal and/or spatial locality.

E.g. may have 1r/1w port on an array of large cache line of 64 bytes,
but have more, narrower, ports within a smaller array.

May use such a cache not to improve latency, but solely to manage ports.
Which are a special form of bandwidth cache. (See [[Varieties of Cache]].)

= TBD =

Q: what other forms of array port reduction have I encountered?

No comments: