The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Wednesday, June 15, 2011

Write combining


Many processors have [[write combining]] support. The [[WC buffers]] and [[USWC memory type]] that I (Andy Glew) added to the Intel P6 are probably by no means the first, or last, such feature. Although arguably pretty successful.

= The basic idea of write combining =

Say that you have a memory bus of width N, e.g. N=64b.

But say that software, for its own nefarious reasons, is writing only in size M, M E.g. say that software is writing a byte at a time.

Simplistically, each such byte write would be wasting 7/8 of the 64b bus.

The basic idea of a write combining buffer is to have a buffer at least N bits wide, with subset validity bits
- e.g. 64 bits, with 1 bit per byte indicating that a byte has been written.

With the valid bits initially empty, as you write to the WC buffer you place data in the corresponding byte, and set the corresponding byte valid bit.

At some later point in time, you may evict the WC buffer. If all of the bytes have been written, you use a single efficient N=64b write. If not, you do some sort of [[partial eviction]].

If the silly software completely overwrites the WC buffer, you have used 8/8 of the bus bandwidth, rather than only 1/8.

== A more realistic example ==

There may be several WC buffers,
each a full cache line (e.g. 64B (not bits, but Bytes) in length.

The [[bus]] is optimized for 8-chunk bursts.
The bus may permit full utilization if 64B burst transfers, but smaller transfers are less efficient es
- and, in fact, in some systems occupy exactly the same number of cycles.
Let's say 8 cycles - 8x8B.

The processor may be clocked faster than the bus. E.g. 8GHz, versus a 1GHz memory bus.

Processor does 64b stores to uncached bit not memory mapped I/O.
Each creates a [[bus write partial line]] command,
occupying 8 cycles on the bus in our example.

If, instead, we use a write combining buffer,
the first 64b/8B store may allocate the [[WC buffer]],
storing data in it
and and set the byte valid bits to

The second 64b/8B store may hit the WC buffer, and set the byte valid bits to

And so on..

When the [[wrte combining buffer]] is full, and when it is eventually evicted,
then a [[bus write full line]] transaction would be done.
In our contrived example, occupying the same 8 cycles as the bus write partial line transcation for each of the stores separately.
Bottom line: 8X speedup.
1/8 the bus occupancy - freeing up the bus for other use.

The point of this exercise was to show that write combining does not need to work with tiny 64b buffers.
Full cache line buffers are possible, and, in fact, are what the Intel P6 and most subsequent x86 machines have.
Also, clocking the processor faster than the bus may motivate WC.
Also, it's not just a question of silly software doing 8b writes: even software doing 64b writes may benefit.

Real world example: on the Intel P6 family, a 64B full cache line transfer ([[BWL (Bus Write Line)]] occupied circa 5 cycles.
A [[BWP (Bus Write Partial)]] transaction occupied 3 cycles.
So the speedup is not quite so extreme as in the contrived example, but is still significant.
Even if the partial writes had been optimized, integer code could only do 32 bit stores, and the bus was 64b wide.

== Write Combining versus Wide Instructions ==

Some may consider that write combining is a poor substitute for having wider instructions.
I.e. instead of building a complicated write combining mechanism, why not create full cache line wide instructions such as
* Load/Store 64B (512b) between vector registers and memory
* [[Load/store-multiple-registers instruction]]

Now, I am nearly always an advocate of explicit software control, even here.
I am quite in favor of vector instructions,
and less so, but still positive, on load/store-multiple-registers.

But history shows: Intel x86 had only 32b integer registers
and worked quite successfully with write combining for more than 5 years before [[x86-64]], 1996-2002;
and even at the time I am writing this does not have 512b/64B full cache line registers.
(Proposed in LRBNI, but not yet implemented.)
I.e. history shows that you can be successful for quite a long time without explicit instruction set support.

Explicit instruction set support may have advantages in other ways.
But it also has costs, e.g. context switching the vector registers.

== Write Combining versus Writeback, DMA, ... ==

Many other arguments have been made against write combining.

E.g. Why not just use [[writeback (WB)]] memory everywhere?
A: although there is more and more cache coherent I/O, at the time I am writing this uncached memory for framebuffers, etc.,
is still very common.

E.g. Why not use DMA engines? Or smart I/O devices such as GPUs?
A: not a complete answer, but basically these do not always exist.
Moreover, I am writing this at a GPU conference,
where the designers of AMD's Llano APU (CPU/GPU hybrid) memory subsystem
expressed, in conversation, the desire for better uncacheable / write combining performance.

Bottom line: the need for write combining is not going away.

= WC policies =

== P6 USWC ==

Intel P6 family USWC memory manages write-combining buffers
in a weakly ordered manner.
E.g. you can write A0, B0, A1, B1, etc. to two different cache lines,
and two different WC buffers will be allocated.
No account is taken, for USWC memory, of write ordering.
E.g. line B may be evicted first, so that B1 is observed before A0.

Eviction is almost random - from the point of view of a programmer.
Certain events are defined to cause evictions, including:
* interrupts
* I/O instructions
* uncacheable memory accesses (to [[UC]] memory) (which are possibly memory mapped I/O)
* possibly certain fence and flushes operations (althohgh x86 fence instructions were added later).

The motivation for these USWC WC buffer eviction policy
was mainly to try to make it look transparent when coordinating with a GPU,
sending commands via MMIO or I/O.

Note: there was no guarantee of timeliness, no guarantee that USWC writes would eventually be flushed out,
e.g. to the framebuffer. Except that mt systems at the time had regular timer interrupts.
Now, circa 2011, [[tickless OS]]es are common.
It might be advisable to have hardware periodically flush USWC.

Carl Amdahl observed that USWC was a cache - a small, non-coherent, cache - not a buffer.

== [[Left to right write combining]] ==

An alternative to P6 USWC's WC policies
was [[left to right write combining]].

This would typically only allocate a WC buffer that wrote to byte 0 of a line.
(Although starting in the middle can be imagined.)

It would perform write combining so long as the write was adjacent to, and to the right, of bytes already written in the WC buffer.

If a write was performed that was not adjacent and to the write, the WC buffer would be evicted immediately.

[[Left to right write combining]] has the putative advantage that it works with some forms of memory mapped I/O - devices for which the registers are designed so that they are written from low address to high. Typically, parameters, with the last location written triggering a side effect.
This is actually a very attractive feature, since it allows efficient MMIO for all devices designed in this way.
Unfortunately, it is not compatible with all MMIO devices - some MMIO devices actually look at the size of the bus transaction, interpretinbg that as an aspect of the MMIO command. (You might consider this stupid, but... )

[[Left to right write combining]] also has the advantage that it preserves write ordering So long as evictions are done left to write.

In an ideal world, there would be two write-combining memory types: left to right, and USWC.
Unfortunately, in P6's [[feature diets]] we had to choose only one, and USWC was it.
USWC is better for certain types of framebuffer worklolads.
[[Left to right write combining]] for well behaved MMIO.

== WC for WB ==

The previous two topics, USWC and [[Left to right write combining]],
deal with write combining with what is fundamentally uncached memory.

Write combining can also be used for [[WB (Writeback)]] memory.

If weakly ordered, this is straightforward.
Similarly, it is straightforward if used to optimize writethrough traffic,
e.g. from a WT L1$ to a WB L2$.
So long as exclusive owne
rship has already been obtained.

However, write combining for a store-ordered memory system like [[TSO]] or [[processor consistency]]
is more of a challenge.

TBD: seek the patents on this. Public info.

= Eviction Mechanism =
At some point it is necessary to evict a write combining buffer.

If the buffer is completely written, then evicting it,
on USWC memory, is straightforward:
* use a [[BWL (Burst Write Line)]] bus transaction that writes the entire line
:: (Caveat: some systems require obtaining ownership before writing like this. But nt P6.)

If the buffer is not completely written, you can use any of the following
* read the missing bytes using a [[BRL (Burst Read Line)]], merge, and then write using [[BWL]].
** works but potentially uses MORE memory traffic than not write combining
* using the most efficient sequence of [[partial writes]] possible
** e.g. using 64b [[write bytes under mask bus transaction]]s - eliding empty chunks
** some systems do not have [[write bytes under mask bus transaction]], but only support 8b, 16b, 32b, etc. partial writes. A state machine can emit a sequence of these
*** note: this probably violates any pretense of the original writes being atomic, particularly if not aligned
* ideally, use an efficient [[BWLM (Burst Write Line under Mask)]] bus transaction.
** this might consist of 4 or 8 data chunks, along with a data chunk that contains a 64 bit mask for the operation.
*** Issue: is the mask attached to address or data? Possibly both. Possibly there is no distinction.
*** Aside: in [[BS: Bitmask Coherency for Writeback Caches]] I discuss the possibility of having two bitmasks, dirty and clean, on such a bus transaction. Possibly with a read version.

= TBD - fatigue =

I'm too tired to finish this. Let me just list some topics:

* [[WC buffer as a store buffer extension]] - constraining eviction order if you want to maintain processor conssistency.

* The [[difference between write combining and write coalescing]]

Writethrough with write allocate versus no write allocate


Write-through caches can be write allocate or no-write allocate.

A no-write allocate write-through cache can only put data in the cache on a read.
A write that misses is written through, to the next cache level or memory, but not stored in the cache.

A write-through with write-allocate cache, if it does not read clean data,
must necessarily have valid bits for bytes or words within the cache line.
Unless the writes can only be cache line sized.

Actually, we can imagine a write-through with write allocate cache, that reads clean data,
and thereby avoids the need to have byte valid bits.
I was about to say that this somewhat misses the point...
and it certainly would on "write and forget" systems.
But, some systems must obtain ownership even on write-through caches.
E.g. the IBM z-series (descendants of the System 360)
must ensure that all other copies of a cache line are invalidated before
"performing" the write.
If you have to do that, you might almost as well obtain the clean data,
and not have to maintain byte dirty bits.

* P6 family: the write-through invalidates other caches, but the data can be read from the local cache before the remote invalidations. Do not need to get a reply from the invalidation.
* IBM family: must invalidate before forwarding locally; i.e. must wait for the invalidation to be complete.