The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Saturday, January 28, 2012

[[Write coalescing]] is the term some GPUs, notably AMD/ATI and Nvidia, use to describe how they, umm, combine or coalesce writes from different N different SIMD threads into a single, or at least fewer than N, accesses. There is also [[read coalescing]], and one can imagine other forms of coalescing, such as atomic fetch-and-op coalescing.

At AFDS11 I (Glew) asked an AMD/ATI GPU architect
"What is the difference between [[write coalescing]] and [[write combining]]?"

He replied that [[write combining]] was an x86 CPU feature that used a [[write combining buffer]],
whereas [[write coalescing]] was a GPU feature that performed the optimization between multiple writes that were occurring simultaneously, not in a buffer.


Since I (Glew) had a lot to do with x86 write combining
- arguably I invented it on P6, although I was inspired by a long line of work in this area,
most notably the [[NYU Ultracomputer]] [[fetch-and-op]] [[combining network]]
- I am not sure that this distinction is fundamental.

Or, rather, it _is_ useful to distinguish between buffer based implementations and implementations that look at simultaneous accesses.

However, in the original NYU terminology, [[combining]] referred to both:
operations received at the same time by a switch in the [[combining network]],
and operations received at a later time that match an operation buffered in the switch,
awaiting either to be forwarded on,
or a reply.
(I'm not sure which was in the Ultracomputer.)

A single P6 processor only did one store per cycle, so a buffer based implementation that performed [[write combining]] between stores
at different times was the only possibility. Or at least the most useful.
Combining stores from different processors was not done (at least, not inside the processor, and could not legally be done to all UC stores).

The NYU Ultracomputer performed this optimization in a switch for multiple processors,
so combining both simultaneous operations and operations performed at different times
was a possibility.

GPUs do many, many, stores at the same time, in a [[data memory coherent]] manner.
This creates a great opportunity for optimizing simultaneous stores.
Although I would be surprised and disappointed to learn that
GPUs did not combine or coalesce
(a) stores from different cycles in the typically 4 cycle wavefront or warp,
(b) stores from different SIMD engines, if they encounter each other on the way to memory.

I conclude therefore that the difference between [[write combining]] and [[write coalescing]] is really one of emphasis.
Indeed, this may be yet another example where my
(Glew's) predilection is to [[create new terms by using adjectives]],
e.g. [[write combining buffer]] or [[buffer-based write combining]]
versus [[simultaneous write combining]] (or the [[AFAIK]] hypiothetical special case [[snoop based write combining]]),
rather than creating gratuitous new terminology,
such as [[write combining]] (implicitly restricted to buffer based)
versus [[write coalescing]] (simultaneous, + ...).

= See Also =

This discussion prompts me to create

* [[a vocabulary of terms for memory operation combining]]

Project structure: code, tests, external depndencies

Say you have a project foo, which I will call .../foo, to emphasize that there is probably a project directory somewhere in the namespace.
Where do you put the tests?  It's nice to put them in .../foo/tests, so that when you checkout the project, you get the tests as well:
    by which I mean
    but  emphasizing the full context via .../ and .../foo/

It's also good to minimize the external dependencies of .../foo.

But what if the tests have more external dependencies than the non-test part of the project.  Should you increase the external dependencies of the non-test part just to have the tests?

Conversely, should you make it harder to write the tests by forbidding external dependencies in them?  Tests are hard enough to write, and they often depend on extra libraries that the source code per se does not?

More specifically, should you get stuff that has increased external dependencies if you check out .../foo and .../foo/tests comes along for the ride?

If you don't want the extra dependencies in .../foo/tests to come along with .../foo, you might structure them as separate modules, possibly separate in the file space:


This works, but it creates extra levels of such "meta-modules": .../foo+tests, .../foo+interactive_tests, .../foo+debugging_tools+tests, etc.

You might structure it as a series of optional modules, all within a single metamodule:


or, rather


and so on.

It's still annoying to have an extra level of indirection, but the purity of the bodily fluids of .../foo is maintained.
Now, of course, it is highly likely that you will have some tests within foo and some that are assoviated with foo, but that foo won't let in:

This can be confusing, but apart from changing names it may be unavoidable: if a country has immigration control country, you often get refugee camps on the border.  The country may ignore them, but the UNHCR does not.

Some systems allow this "stuff" to be overlays within foo:

    .../foo/tests -- optional
    .../foo/stress-tests  -- optional