Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, August 30, 2010

Memory Transaction Topology

* wiki: http://semipublic.comp-arch.net/wiki/Memory_Transaction_Topology
* blogged, comp.arch posted, hbc mailing list.

Let us discuss the simple shape or topology of memory transactions,
from the point of view of a simple interface, with a processor on one side,
and memory on the other.

When I have drawn such diagrams in the past, I have often used the vertical axis to indicate the passing of time.
Here, drawing in ascii, I will not bother to do so.

= Simple Memory Accesses =

;Simple read
* processor sends an address to memory
P ---A---> M
* sooner or later, memory responds with the read data
P <---D--- M ;Simple Write * processor sends address and data to memory P ---AD--> M
Acknowledgement may or may not be necessary; it may be [[store and forget]].

The store address and data may be multiplexed oin the same wires, asis roughly indicated above.
Or they may be in parallel.

* processor sends address and data to memory
P -+-A-+-> M
+-D-+ M

Or they might even be two separate transactions, joined at the memory

P ---A---> M
P ---D---> M

= Interleaved Memory Accesses =

;Interleaved read

Interleaved reads may be satisfied in the original order
P ---A1---> M
P ---A2---> M
P <---D1--- M P <---D2--- M Or out of order. If ooo, then some sort of transaction ID is needed: P ---A1---> M
P ---A2---> M
P <---D2--- M P <---D1--- M Simple writes with address and data available simultaneous do not provide opportunity for interleaving. P ---A1D1--> M
P ---A2D2--> M

However, separate address and data transfiers do:
P ---A1---> M
P ---A2---> M
P ---D1---> M
P ---D2---> M

or out of order:
P ---A1---> M
P ---A2---> M
P ---D2---> M
P ---D1---> M

We usually assume that the address is sent first, but this need not be the case.
P ---A1---> M
P ---D2---> M
P ---D1---> M
P ---A2---> M

Usually, however, interleaving or reordering of stores refers to with respect to the order of stores between different proocessors. TBD.

= Burst Accesses =

Memory transfers often occur in cache line sized bursts, frequently 4 or 8 clock cycles, sometimes more.
Interleaved reads may be satisfied in the original order
P ---A------> M
P <---DDDD--- M Stores may have address and data near to each other P ---ADDDD--> M

Decoupled burst store address and store data is also a possibility, but is not common.

= Multiple Transaction Sizes =
Thus we have severakl different transaction sizzes

;Address Only
P ---A------> M

; Burst Read Data Reply
P <---DDDD--- M ; Burst Store P ---ADDDD--> M

;Various non-burst or partial
in 8b, 16b, 32b, 64b, 128b, ... sizes
as well as possibly misaligned transfers
P <---D--- M = Read-Modify-Write = The following read/write transaction is usually associated with atomic read modify write operations. (e.g. the GUPS benchmark) P ---AopD---> M
op
store new data value
return old
P <---D------ M Note that this can also be done in combination with a cached value P $ ------------AopD---> M
op op
store store new data value
ignore (or check) return old
P $ <-----------D------- M == Modify Write == It is possible to have a modify write transaction, where an operation is performed at memory, but the old value is not returned. P ---AopD---> M
op
store new data value
X <-- M Aread modify without the write is even less useful; it really amounts to having an implicit operand that does not need to be transferred. = Masked Write, including Burst Masked Write == It is common to have a bitmask specifying which bytes should be written: P ---AmD---> M
The mask is so small that it is often part of the command.

Often a read-modify-write is necessary at the receiving end, e.g. for ECC.
But nevertheless, not across the interconnect.

Less commonly there is a masked read, e.g. for side-effectful memory.

There can also be burst masked writes, where the mask may approach and surpass a data chunk width.
P ---AmDDDD---> M

With [[Bitmask consistency]], burst masked writes become the norm.
Reads that indicate which bytes are an absolute requirement,
whether burst masked reads or otherwise, may also be useful.

= Scatter/Gather =

Rather than separate non burst reads
P ---A1---> M
P ---A2---> M
P <---D2--- M P <---D1--- M ... or writes, or RMWs We might take advantage of burst transfers, even though not necessarily burst memory (DRAM) accesses P ---A1.A2.A3.A4---> M
P <---D1.D2.D3.D4--- M P ---A1D1.A2D2.A3D3.A4D4---> M

P ---A1opD1.A2opD2.A3opD3.A4opD4---> M
P <----------D1.D2.D3.D4------------ M For the RMW casse we might constrain the same op to be used everywhere: P --op-A1D1.A2D2.A3D3.A4D4---> M
P <----------D1.D2.D3.D4------------ M = BLUESKY: Reducing the Transaction Types in a Scatter/Gather Memory Subsystem = ; read P ---A---> M
;gather read
P ---A1.A2.A3.A4.A5.A6---> M

; write
P ---op.A.mask.DDDD---> M
; scatter
P ---op.A1D1.A2D2.A3D3---> M

; reply
P <---id.DDDD--- M If A and D chunks are the same size, * the "normal" burst is 6 chunks long (A.mask.DDDD) * with 6 gather read addresses * or 3 scatter write or RMW address/data pairs We might even do away with partial reads (---A--->).
Possibly make all replies look like writes.
; reply
P <---op.A.mask.DDDD--- M

Further, we might compress the values transferred.
Compression does not help fixed size transactions with fixed size memory cache lines,
except for power.
But compression does help fixed size cache lines with variable length transfers.
And compression will allow more scatter/gather requests to fit in a single packet.

In general, we would assume that in a multistage interconnect s/g packeets will be broken up and routed s/g element by element.

Circuit switched, e.g. optical, works against this.