Saturday, January 28, 2012

[[Write coalescing]] is the term some GPUs, notably AMD/ATI and Nvidia, use to describe how they, umm, combine or coalesce writes from different N different SIMD threads into a single, or at least fewer than N, accesses. There is also [[read coalescing]], and one can imagine other forms of coalescing, such as atomic fetch-and-op coalescing.

At AFDS11 I (Glew) asked an AMD/ATI GPU architect
"What is the difference between [[write coalescing]] and [[write combining]]?"

He replied that [[write combining]] was an x86 CPU feature that used a [[write combining buffer]],
whereas [[write coalescing]] was a GPU feature that performed the optimization between multiple writes that were occurring simultaneously, not in a buffer.


Since I (Glew) had a lot to do with x86 write combining
- arguably I invented it on P6, although I was inspired by a long line of work in this area,
most notably the [[NYU Ultracomputer]] [[fetch-and-op]] [[combining network]]
- I am not sure that this distinction is fundamental.

Or, rather, it _is_ useful to distinguish between buffer based implementations and implementations that look at simultaneous accesses.

However, in the original NYU terminology, [[combining]] referred to both:
operations received at the same time by a switch in the [[combining network]],
and operations received at a later time that match an operation buffered in the switch,
awaiting either to be forwarded on,
or a reply.
(I'm not sure which was in the Ultracomputer.)

A single P6 processor only did one store per cycle, so a buffer based implementation that performed [[write combining]] between stores
at different times was the only possibility. Or at least the most useful.
Combining stores from different processors was not done (at least, not inside the processor, and could not legally be done to all UC stores).

The NYU Ultracomputer performed this optimization in a switch for multiple processors,
so combining both simultaneous operations and operations performed at different times
was a possibility.

GPUs do many, many, stores at the same time, in a [[data memory coherent]] manner.
This creates a great opportunity for optimizing simultaneous stores.
Although I would be surprised and disappointed to learn that
GPUs did not combine or coalesce
(a) stores from different cycles in the typically 4 cycle wavefront or warp,
(b) stores from different SIMD engines, if they encounter each other on the way to memory.

I conclude therefore that the difference between [[write combining]] and [[write coalescing]] is really one of emphasis.
Indeed, this may be yet another example where my
(Glew's) predilection is to [[create new terms by using adjectives]],
e.g. [[write combining buffer]] or [[buffer-based write combining]]
versus [[simultaneous write combining]] (or the [[AFAIK]] hypiothetical special case [[snoop based write combining]]),
rather than creating gratuitous new terminology,
such as [[write combining]] (implicitly restricted to buffer based)
versus [[write coalescing]] (simultaneous, + ...).

