The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, June 19, 2011

A vocabulary of terms for memory operation combining

There are many different terms for the common optimization of memory operation combining or coalescing.
See the discussions of the [[difference between write combining and write coalescing]],
as well as [[write combining]] and [[write coalescing]],
and [[NYU Ultracoomputer]] [[fetch-and-op]] [[combining networks]].

Here is a by no means complete list of such terms:

* [[combining]]
** [[write combining]]
*** [[write combining buffer]]s
*** [[write combining cache]]
** [[read combining]]
** [[fetch-and-op combining]]
*** [[NYU Ultracomputer]] [[fetch-and-op]] [[combining networks]]

* [[coalescing]]
:: on GPUs...
** [[write coalescing]]
** [[read coalescing]]

I have not heard anyone talk about fetch-and-op or atomic RMW coalescing
as distinct from the historic
[[NYU Ultracomputer]] [[fetch-and-op combining]].
But I suppose that inevitably this will arise.

* [[squashing]]
:: mainly refers to finding that an operation is unnecessary, and cancelling it - or at least the unnecessary part
** [[load squashing]]
::: On P6, referred to a cache miss finding that there was already a cache miss in flight for the same cache line. No need to issue a new bus request, but the squashed request was not completely cancelled - arrangement was made so that data return from the squashing cache miss would
** [[store squashing]]
::: I am not aware of this being done, but it could refer to comparing store addresses in a [[store buffer]], and determining that an older store is unnecessary since a later store completely overwrites it (and there would be no intervening operations that make it visible according to the [[memory ordering model]]). (Actually, I am not sure that there could be such, but I am leaving the possibility as a placeholder.)
::: [[Write combining]] accomplishes much the same thing, although [[store squashing]] in the store buffer gets it done earlier.
::: Note that this is similar to [[store buffer combining]] - combining entries in the store buffer, separate from a [[write combining buffer]].

Again, I have not heard of fetch-and-op squashing, although I am sure that it could be done, e.g. for lossy operations such as AND and OR (a later fetch-and-OR with a bitmask that completely includes an earlier...).

* [[snarfing]]
:: I have usually seen this term used in combination with a [[snoopy bus]], although it can also be used with other interconnects. [[Read snarfing]] means that a pending request from P1 that has not yet received the bus in arbitration observes the data for the same cachelines from a different processor P0, and "snarfs" the data as it goes by.
Depending on the cache protocol, it may be necessary to assert a signal to put data into [[shared (S) state]] rather than [[exclusive (E) state]].

I am not sure what [[write snarfing]] would look like.
An [[update cache]] is somewhat like, but it updates an existing cache line, not an existing request.
I.e. an update cache protocol snarfs write data from the bus or interconnect to update a cache line.
Whereas a pending load can snarf data either from read replies or write data transactions on the bus or interconnect.
I.e. a read can snarf from a read or a write.
But does a write "snarf"?
More like a write may combine with another write -
[[snoopy bus based write combining]],
as distinct from [[buffer based write combining]].

Difference between write combining and write coalescing

[[Write coalescing]] is the term some GPUs, notably AMD/ATI and Nvidia, use to describe how they, umm, combine or coalesce writes from different N different SIMD threads into a single, or at least fewer than N, accesses. There is also [[read coalescing]], and one can imagine other forms of coalescing, such as atomic fetch-and-op coalescing.

At AFDS11 I (Glew) asked an AMD/ATI GPU architect
"What is the difference between [[write coalescing]] and [[write combining]]?"

He replied that [[write combining]] was an x86 CPU feature that used a [[write combining buffer]],
whereas [[write coalescing]] was a GPU feature that performed the optimization between multiple writes that were occurring simultaneously, not in a buffer.


Since I (Glew) had a lot to do with x86 write combining
- arguably I invented it on P6, although I was inspired by a long line of work in this area,
most notably the [[NYU Ultracomputer]] [[fetch-and-op]] [[combining network]]
- I am not sure that this distinction is fundamental.

Or, rather, it _is_ useful to distinguish between buffer based implementations and implementations that look at simultaneous accesses.

However, in the original NYU terminology, [[combining]] referred to both:
operations received at the same time by a switch in the [[combining network]],
and operations received at a later time that match an operation buffered in the switch,
awaiting either to be forwarded on,
or a reply.
(I'm not sure which was in the Ultracomputer.)

A single P6 processor only did one store per cycle, so a buffer based implementation that performed [[write combining]] between stores
at different times was the only possibility. Or at least the most useful.
Combining stores from different processors was not done (at least, not inside the processor, and could not legally be done to all UC stores).

The NYU Ultracomputer performed this optimization in a switch for multiple processors,
so combining both simultaneous operations and operations performed at different times
was a possibility.

GPUs do many, many, stores at the same time, in a [[data memory coherent]] manner.
This creates a great opportunity for optimizing simultaneous stores.
Although I would be surprised and disappointed to learn that
GPUs did not combine or coalesce
(a) stores from different cycles in the typically 4 cycle wavefront or warp,
(b) stores from different SIMD engines, if they encounter each other on the way to memory.

I conclude therefore that the difference between [[write combining]] and [[write coalescing]] is really one of emphasis.
Indeed, this may be yet another example where my
(Glew's) predilection is to [[create new terms by using adjectives]],
e.g. [[write combining buffer]] or [[buffer-based write combining]]
versus [[simultaneous write combining]] (or the [[AFAIK]] hypiothetical special case [[snoop based write combining]]),
rather than creating gratuitous new terminology,
such as [[write combining]] (implicitly restricted to buffer based)
versus [[write coalescing]] (simultaneous, + ...).

= See Also =

This discussion prompts me to create

* [[a vocabulary of terms for memory operation combining]]