The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, November 16, 2010


I keep plodding along:


wikied, USEnet comp.arch posted, andblogged

= Array Basics =

The classic array is a 2D crosspoint array of memory cells.

In an SRAM based array, the memory cells are cross-linked inverters (or a degenerate version thereof).
In a DRAM based array, the memory cells are capacitors.
In other memory technologies, the memory cells may be a floating gate (which is really a capacitor),
a bit of glass that can be melted, magnetic material, etc., etc.

>2D arrays are possible - but 2D is the basic.
(Actually, 1D memories have been created - e.g. delay lines, magnetic bubble racetrack memory, etc. - but that is not in the scope of this topic.)

Conventionally, one of the two dimensions, let's say X, horizontal, carries word lines, or decoded address lines.
The other, Y, vertical, carries bit lines or data lines.

    (I don't want to say just "bit lines", since multivalued logic is a possibility.)

One of the word lines is activated, and it selects all of the memory cells along that word, and causes them to connect to the bit lines. Either the value is read out, or a new value is written in.

    Usually only one word line at a time is activated. Although you could activate more than one, e.g. and OR or AND the several bits selected for read. Conversely, you could write a value to multiple words at once. However, this is uncommon; it often pushes the edges of circuitry; and some systems will, in fact, have logic to prevent this from ever happening.

= A Single Port =

Sometimes there is a single bitline; sometimes there is a pair of bitlines, typically Q and Q-bar.

Sometimes there is a single wordline, and another wire, often parallel to the wordline, that indicates read or write.

Sometimes there are separate wordlines for read and write.

== Minor Tweaks ==

Read/write is a "diffuse" signal: while there must be a word line per, well, word, for a single ported array we might be able to have a global signal that said "write". Such a global signal could be, well, diffuse - signaled in the ground plane, etc. Possibly by a light shining from the 3rd dimension. Or you could route the write enables vertically, or diagonally. Usually there are not good ideas, costong more than they save in area or power.

However, you can share read/write lines between pairs of wordlines. Or even larger sets of wordlines, at the cost of routing to get the signal to where it would need to go.

See [[byte write enables]] or [[subword write enables]]

= True Multiport Arrays =

The simplest possible array might have a single bitline, a single wordline, and a read/write signal.
Typically two horizontal wires, and 1 vertical.
This would implement a single port that can be used for both read and write.

The next step in complexity would be to have
* a read port, with a wordline and a bitline
* a write port, with a dedicated wordline and a bitline

I call these true ports because almost arbitrary words can be simultaneously accessed in the same array.

    Almost arbitrary... often there are constraints, such as * never writing to the same bitcells (words) simultaneously * never reading and writing the same bitcells simultaneously * sometimes never reading from more than M ports at a time. sometimes, the read limit M=1

In general, each true port costs 1 X and 1 Y wire.
This implies that true ports have an area cost of O(N2).

    At least. Many technologies require dual bitlines, or, two bitlines to write, 1 to read. Note the possibility of a hybrid port with three wordlines and two bitlines, using bit/bit-bar to write, and reading on each bitline separately.

Simple arrays with small numbers of ports may not be wire limited: they may be limited by the bitcells.
But, as the number of ports grows, eventually the O(N2) effect kicks in.

    == Enumerate Ports when designing Arrays == Because of the possibilities of odd tradeoffs, such as combined read;/write ports, or 1 write 2 read ports, it is desirable to enumerate all of the ports you need in your [[abstract array design]], and also to think about which ports may not need to be simultaneously active.

= Array Replication for Read Multiporting =

One way to mitigate the O(N2) effect of ports is to replicate the array.
E.g. a 2-read port, single write port array
may be implemented as two 1R 1W arrays,
with separate read ports,
but the write ports tied together.

Similarly, 2R-1W has area 9c, whereas 2x1R-1R has area 2(4c) = 8c.
The effect gets bigger as the number of read ports increases.

However, there is often a cost to distributing the wires to the arrays.
Simplistically, if all flows along a single datapath, there is O(n2) cost to "turning" the wires to go to the separate array.
However2, circumstances may vary: it may not be necessary to turn all the wires back into the same datapath, etc.

    ;Bluesky (at the time of writing, 2010): If you can work with the third dimension, the costs of such array replication to get extra read ports is lowered: whereas in 2D routing to a separate array costs O(N62), in 3D routing to a separate array placed on a layer above the first is O(n).

= Time Multiplexing =

Another way to get multiple ports is by time multiplexing.

You might, for example, "double pump" an array to get effectively twice the ports.
This can even be done with wave pipelining.

Similarly, you might know by construction that reads can only occur, in 2 out of every 3 clock cycles, while writes may occur only every third clock cycle.

If you cannot make such guarantees, it is necessary to have a mechanism for handling conflicts.
If you can make average rate guarantees but not instantaneous rate guarantees,
a few buffers may suffice.

= Banks =

Banks are a form of spatial multiplexing.
Similar to array replication, but without tying together the write ports.

Address bits are used to assign requests to particular banks.
This weighs heavily in favor of power of two sized banks,
although [[non-power-of-two-sized indexing]] is possible.
especially for arrays relatively far out in the memory subsystem, such as DRAM (e.g. Matrox had 17 way interleaved memory)
or outer level caches.

Bank selection bits may be chosen so as to interleave - successive words in the logical arrray may be assigned to different banks. Or they may be coarse grained - e.g. the lowest half of the address space is assigned to thefirst bank, tyhe upper half to a second bank.

The two may be intermixed - interleaving, and coarse grain banking.
Indeed, there may be multiple levels of interleaving and coarse grain banking.

Before we leave this topic, I want to mention that cache lines may possibly be subject to both
intra-cacheline-banking, and inter-cacheline-banking.
E.g. a 64B (512b) cache line may be divided into 8B (64b) banks within the cacheline,
while the cachelines are interleaved between coarser scalee banks.

= Terminology: Banks, Ranks, etc. =

Terminology may vary.

I (Andy Glew) tend to follow my standard practice of using [[standard terms, with distinguishing adjectives]].
E.g. I say any division of an array is a bank, and the process of distributing addresses across the banks is an interleaving pattern.

Other people invent different terms:
e.g. banks within a DDR DRAM;
DRAM chips mounted on DIMMS, operated as separate ranks;
memory may be interleaved across different DRAM channels.

= Bank Conflicts and Other Port Conflicts =

Whenever there are not enough ports to meet all needs the possibility of conflicting accesses arises.

More terminology: I sometimes call banking pseudo-multiporting.

With power-of-2 bitfield extraction used to create the bank selection and indexing functions, there often arise "resonances" on common power of two boundaries. Such as 64KiB, 1MiB, etc.

Simple hashing, XORing a few upper bits into the indexing and selection functions, sometimes helps.
Non-power-of-two indexing can also help.

Mechanisms to handle bank conflicts can lead to great complexity.

The simplest case would be to stall one of the requests.
Simple - unless you are working on a statically scheduled machine without pipeline interlocks.

Simple - but of course immediately stalling is impossible in many high speed designs.

Replay or recycling the request is another way of handling bank conflicts:
a fixed pipeline's worth of requests drains out,
and is routed back to the pipeline or arrays inputs to be retried.
This is a good workable idea - but it can cause positive feedback leading to unstable behavior.
See [[Replay Pipelines]].

I don't know of any other alternatives to stall or replay - although there are many flavors of each.
However, the microarchitecture can arrange things so as to mitigate the cost of bank conflicts,
(or other port conflicts):

E.g. P6 had a pseudo-dual-ported, banked, L1 cache array, with 1 read and 1 write port. The read port always took priority, since handling such conflicts on a read is more complex because dependent operations may be scheduled.
Whereas write port conflicts, while affecting bandwidth, do not have dependejnt operations.

Similarly, what I call a [[2 and a half ported array]] is another nice design point:
true dual ports used for two read ports,
with a write port using pseudo-multiport banking.
Bank conflicts on the dual ported reads are avoided, while on the writes bank conflicts can sometimes be tolerated,
or scheduled around.

= Related =
* [[Bank scheduling]]
* [[Read and Write Combining]]