The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Wednesday, June 29, 2011

Notions of Time in Computer Architecture


Advanced computer architecture admits several different notions of time.

= Real World Time =

First, there is [[real time]] or [[wall clock time]]. Time in the real world (forgetting Einsteinian relativistic issues for the moment).

Unfortunately, the term [[real time]], which would be the best term for this, is often overloaded with the additional meaning
of pertaining to a system that must respond in real time - e.g. "Drop the rods in the nuclear power plant quickly or the nuclear fuel will melt down."
Or "adjust the aileron pitch with a few milliseconds, or the plane will stall and crash."
These applications are often called [[hard real time]], and there are also [[soft real time]] applications that have less stringent timing requirements.

Because of the use of real time in phrases such as [[real time operating system]] or [[real time response]], terms such as [[wall clock time]] may be used.

== Absolute versus Relative ==

Real time (and, indeed, other times) may be absolute or relative.

Absolute time is wall clock time, or calendar time.

Relative time is the amount of elapsed time since some other event. E.g. one might start a timer on entering a function, and read it on exit.

Obviously, you can calculate relative timer from absolute time samples.
But, it can be considerably easier to build the hardware to measure relative time that it is to measure ab solute time.
Timers used in relative timing may be in isolation, whereas clocks measuring absolute time may need to be kept in synchronization
with other clocks in other systems.

== [[Timers versus Timestamps]] ==

Oftentimes appplications do not need actual time.
Instead, what they really need is timestamps that indicate relative position.
The application may care who came first or second, but not how much time elapsed between arrivals.

Therefore, some timer architectures provide absolute time in uppper bits,
but simply indicate a unique counter in the lower bits of a timestamp.

= Von Neuman Time =

In out-of-order and speculative processors,
even when single threaded, it is often convenient to talk about the
[[Von Neumann sequence number]] or [[Von Neumann time]].

On a single threaded program, this is the sequence number at which every and any [[dynamic instruction sequence]] executes.

The [[Von Neumann time]] or sequence number for an instruction is NOT the [[wall clock time]]
at which an instruction executes.
Even on simple in-order processors, where instructions may have differing [[instruction latency]],
the VN time is not linearly related to the wall clock time.
On complex out-of-order processors instructions that are later in the Von Neumman instruction sequence
may execute earlier in wall clock time (that's what makes them out-of-order).

The [[Von Neumann time]] or sequence number may be a convenient book-keeping tool.
* it is often made available in a [[simulator]] user interface
* I have proposed using it it [[hardware data structures]] such as store buffers and schedulers.
** When I talk about an "age based scheduler", I usually mean a [[Von Neumann age]] based scheduler.
** Of course, real hardware must implement a finite number of bits, typically wrapping

== Multithreading ==

The [[Von Neumann time]] or sequence number is unambiguous for a single threaded program.

For multithreaded programs that interact, it is necessary to establish correspondences - VN1 time of thread A corresponds to VN time of thread B,
at different points if interaction.

It should be seen that this multithreaded time is a partial order. Ideally one from which a total order can be constructed.
(Although certain microarchitectural interactions may imply cycles in such a time graph.)

I know of no standard term for this [[multithreaded time]]. [[Multithreaded Von Neuman time]].

I think that it might be nice to call it [[Lamport time]],
since Leslie Lamport was a pioneer in this area.

= Pipeline Time(s) =

Any given [[dynamic instruction instance]] desn't execute at a single point in time.,
Rather, it executes in different ways at many different points in tiime:
* [[instruction fetch time]]
* [[decode time]]
* [[schedule time]]
* [[operand ready time]]
* [[execution time]]
* [[register read time]]
* [[writeback time]]
* re-execution time]] or [[replay time]]
* [[commit time]]
* [[retirement time]]

Plus of course there is the
* [[Von Neuman time]] or [[Von Neumann sequence number]] of a [[dynamic instruction instance]].

For that matter
* the [[Von Neumann time]] relative to a given thread of execution
may differe from the
* [[Von Neuman time]] relative to the processor it is executed on
because of [[timesharing]] or [[multitasking]] between threads or processors.

= Simulator Times =

* Real time of the machine on which the simulator runs: [[simulator real time]]
* [[Simulated real time]] of the workload running on the simulator: [[simulatee real time]]
* [[Von Neumann time]] or sequence number for the workload running on the simulator.

This can be fully recursive. A simulator may be ruunning on top of a simulator on top of ...

= Conclusion =

Be careful what notion of time you are talking about.
It may not be the notion of time for the person you are talking to.

[[Microarchitecture]] is concerned with [[real time]].
[[Macroarchitecture]] is concerned more with abstract time, such as [[Von Neumann time]].
Systems that have different microarchitectures and hence different real time behavior,
may execute the same [[macroarchitecture]].
But, of course, [[macroarchitecture]] features such as [[ISA]] greatly influence [[real time]] performance.

Thursday, June 23, 2011

cygwin startxwin problem with multiple displays

I use cygwin on my Windows PCs. Especially my tablet PC. AFAIK there is no equally good UNIX or Linux or Apple equivalent for a Microsoft Windows tablet PC. Tell me if there is. (Yes, I write and draw with the pen.))

Unfortunately, my cygwin setup has decayed - things stopped working on various upgrades, and I do not always fix the problem immediately. Here's one:

For a while, I would use startxwin to start up X Windows in Cygwin's multiwindow mode. At some point, it worked, but the xterm started by default was off obably havescreen, could not be seen. If I maximized I could find it, but when non-maximized, not seen.

Turns out that startxwin's default was to start "xterm -geometry +1+1 ...".

And I have been using multiple displays - e.g. currently have 4. Irregularly arranged. So that there is no display at +1+1 in the bounding box of all displays.

FIX: provide my own ~/.startxwinrc, with no location specified for the xterm.


Meta-issue: almost want a sanity check, to make sure that no windows are raised offscreen, invisibly.

Overall, there are probably a lot of such issues relating to assumptions about screen geometry being regular and rectangular, that no longer hold in the world of multiple displays.


Another meta-issue: took a while to find out where the xterm was coming from.

It would have been nice to have an audit trail that allowed me to find where the xterm came from.

Similarly, to find out where its geometry parameter came from.

As it was, I just guessed "startxwin". But I am embarassed to say how long it took...

Sunday, June 19, 2011

A vocabulary of terms for memory operation combining

There are many different terms for the common optimization of memory operation combining or coalescing.
See the discussions of the [[difference between write combining and write coalescing]],
as well as [[write combining]] and [[write coalescing]],
and [[NYU Ultracoomputer]] [[fetch-and-op]] [[combining networks]].

Here is a by no means complete list of such terms:

* [[combining]]
** [[write combining]]
*** [[write combining buffer]]s
*** [[write combining cache]]
** [[read combining]]
** [[fetch-and-op combining]]
*** [[NYU Ultracomputer]] [[fetch-and-op]] [[combining networks]]

* [[coalescing]]
:: on GPUs...
** [[write coalescing]]
** [[read coalescing]]

I have not heard anyone talk about fetch-and-op or atomic RMW coalescing
as distinct from the historic
[[NYU Ultracomputer]] [[fetch-and-op combining]].
But I suppose that inevitably this will arise.

* [[squashing]]
:: mainly refers to finding that an operation is unnecessary, and cancelling it - or at least the unnecessary part
** [[load squashing]]
::: On P6, referred to a cache miss finding that there was already a cache miss in flight for the same cache line. No need to issue a new bus request, but the squashed request was not completely cancelled - arrangement was made so that data return from the squashing cache miss would
** [[store squashing]]
::: I am not aware of this being done, but it could refer to comparing store addresses in a [[store buffer]], and determining that an older store is unnecessary since a later store completely overwrites it (and there would be no intervening operations that make it visible according to the [[memory ordering model]]). (Actually, I am not sure that there could be such, but I am leaving the possibility as a placeholder.)
::: [[Write combining]] accomplishes much the same thing, although [[store squashing]] in the store buffer gets it done earlier.
::: Note that this is similar to [[store buffer combining]] - combining entries in the store buffer, separate from a [[write combining buffer]].

Again, I have not heard of fetch-and-op squashing, although I am sure that it could be done, e.g. for lossy operations such as AND and OR (a later fetch-and-OR with a bitmask that completely includes an earlier...).

* [[snarfing]]
:: I have usually seen this term used in combination with a [[snoopy bus]], although it can also be used with other interconnects. [[Read snarfing]] means that a pending request from P1 that has not yet received the bus in arbitration observes the data for the same cachelines from a different processor P0, and "snarfs" the data as it goes by.
Depending on the cache protocol, it may be necessary to assert a signal to put data into [[shared (S) state]] rather than [[exclusive (E) state]].

I am not sure what [[write snarfing]] would look like.
An [[update cache]] is somewhat like, but it updates an existing cache line, not an existing request.
I.e. an update cache protocol snarfs write data from the bus or interconnect to update a cache line.
Whereas a pending load can snarf data either from read replies or write data transactions on the bus or interconnect.
I.e. a read can snarf from a read or a write.
But does a write "snarf"?
More like a write may combine with another write -
[[snoopy bus based write combining]],
as distinct from [[buffer based write combining]].

Difference between write combining and write coalescing

[[Write coalescing]] is the term some GPUs, notably AMD/ATI and Nvidia, use to describe how they, umm, combine or coalesce writes from different N different SIMD threads into a single, or at least fewer than N, accesses. There is also [[read coalescing]], and one can imagine other forms of coalescing, such as atomic fetch-and-op coalescing.

At AFDS11 I (Glew) asked an AMD/ATI GPU architect
"What is the difference between [[write coalescing]] and [[write combining]]?"

He replied that [[write combining]] was an x86 CPU feature that used a [[write combining buffer]],
whereas [[write coalescing]] was a GPU feature that performed the optimization between multiple writes that were occurring simultaneously, not in a buffer.


Since I (Glew) had a lot to do with x86 write combining
- arguably I invented it on P6, although I was inspired by a long line of work in this area,
most notably the [[NYU Ultracomputer]] [[fetch-and-op]] [[combining network]]
- I am not sure that this distinction is fundamental.

Or, rather, it _is_ useful to distinguish between buffer based implementations and implementations that look at simultaneous accesses.

However, in the original NYU terminology, [[combining]] referred to both:
operations received at the same time by a switch in the [[combining network]],
and operations received at a later time that match an operation buffered in the switch,
awaiting either to be forwarded on,
or a reply.
(I'm not sure which was in the Ultracomputer.)

A single P6 processor only did one store per cycle, so a buffer based implementation that performed [[write combining]] between stores
at different times was the only possibility. Or at least the most useful.
Combining stores from different processors was not done (at least, not inside the processor, and could not legally be done to all UC stores).

The NYU Ultracomputer performed this optimization in a switch for multiple processors,
so combining both simultaneous operations and operations performed at different times
was a possibility.

GPUs do many, many, stores at the same time, in a [[data memory coherent]] manner.
This creates a great opportunity for optimizing simultaneous stores.
Although I would be surprised and disappointed to learn that
GPUs did not combine or coalesce
(a) stores from different cycles in the typically 4 cycle wavefront or warp,
(b) stores from different SIMD engines, if they encounter each other on the way to memory.

I conclude therefore that the difference between [[write combining]] and [[write coalescing]] is really one of emphasis.
Indeed, this may be yet another example where my
(Glew's) predilection is to [[create new terms by using adjectives]],
e.g. [[write combining buffer]] or [[buffer-based write combining]]
versus [[simultaneous write combining]] (or the [[AFAIK]] hypiothetical special case [[snoop based write combining]]),
rather than creating gratuitous new terminology,
such as [[write combining]] (implicitly restricted to buffer based)
versus [[write coalescing]] (simultaneous, + ...).

= See Also =

This discussion prompts me to create

* [[a vocabulary of terms for memory operation combining]]

Wednesday, June 15, 2011

Write combining


Many processors have [[write combining]] support. The [[WC buffers]] and [[USWC memory type]] that I (Andy Glew) added to the Intel P6 are probably by no means the first, or last, such feature. Although arguably pretty successful.

= The basic idea of write combining =

Say that you have a memory bus of width N, e.g. N=64b.

But say that software, for its own nefarious reasons, is writing only in size M, M E.g. say that software is writing a byte at a time.

Simplistically, each such byte write would be wasting 7/8 of the 64b bus.

The basic idea of a write combining buffer is to have a buffer at least N bits wide, with subset validity bits
- e.g. 64 bits, with 1 bit per byte indicating that a byte has been written.

With the valid bits initially empty, as you write to the WC buffer you place data in the corresponding byte, and set the corresponding byte valid bit.

At some later point in time, you may evict the WC buffer. If all of the bytes have been written, you use a single efficient N=64b write. If not, you do some sort of [[partial eviction]].

If the silly software completely overwrites the WC buffer, you have used 8/8 of the bus bandwidth, rather than only 1/8.

== A more realistic example ==

There may be several WC buffers,
each a full cache line (e.g. 64B (not bits, but Bytes) in length.

The [[bus]] is optimized for 8-chunk bursts.
The bus may permit full utilization if 64B burst transfers, but smaller transfers are less efficient es
- and, in fact, in some systems occupy exactly the same number of cycles.
Let's say 8 cycles - 8x8B.

The processor may be clocked faster than the bus. E.g. 8GHz, versus a 1GHz memory bus.

Processor does 64b stores to uncached bit not memory mapped I/O.
Each creates a [[bus write partial line]] command,
occupying 8 cycles on the bus in our example.

If, instead, we use a write combining buffer,
the first 64b/8B store may allocate the [[WC buffer]],
storing data in it
and and set the byte valid bits to

The second 64b/8B store may hit the WC buffer, and set the byte valid bits to

And so on..

When the [[wrte combining buffer]] is full, and when it is eventually evicted,
then a [[bus write full line]] transaction would be done.
In our contrived example, occupying the same 8 cycles as the bus write partial line transcation for each of the stores separately.
Bottom line: 8X speedup.
1/8 the bus occupancy - freeing up the bus for other use.

The point of this exercise was to show that write combining does not need to work with tiny 64b buffers.
Full cache line buffers are possible, and, in fact, are what the Intel P6 and most subsequent x86 machines have.
Also, clocking the processor faster than the bus may motivate WC.
Also, it's not just a question of silly software doing 8b writes: even software doing 64b writes may benefit.

Real world example: on the Intel P6 family, a 64B full cache line transfer ([[BWL (Bus Write Line)]] occupied circa 5 cycles.
A [[BWP (Bus Write Partial)]] transaction occupied 3 cycles.
So the speedup is not quite so extreme as in the contrived example, but is still significant.
Even if the partial writes had been optimized, integer code could only do 32 bit stores, and the bus was 64b wide.

== Write Combining versus Wide Instructions ==

Some may consider that write combining is a poor substitute for having wider instructions.
I.e. instead of building a complicated write combining mechanism, why not create full cache line wide instructions such as
* Load/Store 64B (512b) between vector registers and memory
* [[Load/store-multiple-registers instruction]]

Now, I am nearly always an advocate of explicit software control, even here.
I am quite in favor of vector instructions,
and less so, but still positive, on load/store-multiple-registers.

But history shows: Intel x86 had only 32b integer registers
and worked quite successfully with write combining for more than 5 years before [[x86-64]], 1996-2002;
and even at the time I am writing this does not have 512b/64B full cache line registers.
(Proposed in LRBNI, but not yet implemented.)
I.e. history shows that you can be successful for quite a long time without explicit instruction set support.

Explicit instruction set support may have advantages in other ways.
But it also has costs, e.g. context switching the vector registers.

== Write Combining versus Writeback, DMA, ... ==

Many other arguments have been made against write combining.

E.g. Why not just use [[writeback (WB)]] memory everywhere?
A: although there is more and more cache coherent I/O, at the time I am writing this uncached memory for framebuffers, etc.,
is still very common.

E.g. Why not use DMA engines? Or smart I/O devices such as GPUs?
A: not a complete answer, but basically these do not always exist.
Moreover, I am writing this at a GPU conference,
where the designers of AMD's Llano APU (CPU/GPU hybrid) memory subsystem
expressed, in conversation, the desire for better uncacheable / write combining performance.

Bottom line: the need for write combining is not going away.

= WC policies =

== P6 USWC ==

Intel P6 family USWC memory manages write-combining buffers
in a weakly ordered manner.
E.g. you can write A0, B0, A1, B1, etc. to two different cache lines,
and two different WC buffers will be allocated.
No account is taken, for USWC memory, of write ordering.
E.g. line B may be evicted first, so that B1 is observed before A0.

Eviction is almost random - from the point of view of a programmer.
Certain events are defined to cause evictions, including:
* interrupts
* I/O instructions
* uncacheable memory accesses (to [[UC]] memory) (which are possibly memory mapped I/O)
* possibly certain fence and flushes operations (althohgh x86 fence instructions were added later).

The motivation for these USWC WC buffer eviction policy
was mainly to try to make it look transparent when coordinating with a GPU,
sending commands via MMIO or I/O.

Note: there was no guarantee of timeliness, no guarantee that USWC writes would eventually be flushed out,
e.g. to the framebuffer. Except that mt systems at the time had regular timer interrupts.
Now, circa 2011, [[tickless OS]]es are common.
It might be advisable to have hardware periodically flush USWC.

Carl Amdahl observed that USWC was a cache - a small, non-coherent, cache - not a buffer.

== [[Left to right write combining]] ==

An alternative to P6 USWC's WC policies
was [[left to right write combining]].

This would typically only allocate a WC buffer that wrote to byte 0 of a line.
(Although starting in the middle can be imagined.)

It would perform write combining so long as the write was adjacent to, and to the right, of bytes already written in the WC buffer.

If a write was performed that was not adjacent and to the write, the WC buffer would be evicted immediately.

[[Left to right write combining]] has the putative advantage that it works with some forms of memory mapped I/O - devices for which the registers are designed so that they are written from low address to high. Typically, parameters, with the last location written triggering a side effect.
This is actually a very attractive feature, since it allows efficient MMIO for all devices designed in this way.
Unfortunately, it is not compatible with all MMIO devices - some MMIO devices actually look at the size of the bus transaction, interpretinbg that as an aspect of the MMIO command. (You might consider this stupid, but... )

[[Left to right write combining]] also has the advantage that it preserves write ordering So long as evictions are done left to write.

In an ideal world, there would be two write-combining memory types: left to right, and USWC.
Unfortunately, in P6's [[feature diets]] we had to choose only one, and USWC was it.
USWC is better for certain types of framebuffer worklolads.
[[Left to right write combining]] for well behaved MMIO.

== WC for WB ==

The previous two topics, USWC and [[Left to right write combining]],
deal with write combining with what is fundamentally uncached memory.

Write combining can also be used for [[WB (Writeback)]] memory.

If weakly ordered, this is straightforward.
Similarly, it is straightforward if used to optimize writethrough traffic,
e.g. from a WT L1$ to a WB L2$.
So long as exclusive owne
rship has already been obtained.

However, write combining for a store-ordered memory system like [[TSO]] or [[processor consistency]]
is more of a challenge.

TBD: seek the patents on this. Public info.

= Eviction Mechanism =
At some point it is necessary to evict a write combining buffer.

If the buffer is completely written, then evicting it,
on USWC memory, is straightforward:
* use a [[BWL (Burst Write Line)]] bus transaction that writes the entire line
:: (Caveat: some systems require obtaining ownership before writing like this. But nt P6.)

If the buffer is not completely written, you can use any of the following
* read the missing bytes using a [[BRL (Burst Read Line)]], merge, and then write using [[BWL]].
** works but potentially uses MORE memory traffic than not write combining
* using the most efficient sequence of [[partial writes]] possible
** e.g. using 64b [[write bytes under mask bus transaction]]s - eliding empty chunks
** some systems do not have [[write bytes under mask bus transaction]], but only support 8b, 16b, 32b, etc. partial writes. A state machine can emit a sequence of these
*** note: this probably violates any pretense of the original writes being atomic, particularly if not aligned
* ideally, use an efficient [[BWLM (Burst Write Line under Mask)]] bus transaction.
** this might consist of 4 or 8 data chunks, along with a data chunk that contains a 64 bit mask for the operation.
*** Issue: is the mask attached to address or data? Possibly both. Possibly there is no distinction.
*** Aside: in [[BS: Bitmask Coherency for Writeback Caches]] I discuss the possibility of having two bitmasks, dirty and clean, on such a bus transaction. Possibly with a read version.

= TBD - fatigue =

I'm too tired to finish this. Let me just list some topics:

* [[WC buffer as a store buffer extension]] - constraining eviction order if you want to maintain processor conssistency.

* The [[difference between write combining and write coalescing]]

Writethrough with write allocate versus no write allocate


Write-through caches can be write allocate or no-write allocate.

A no-write allocate write-through cache can only put data in the cache on a read.
A write that misses is written through, to the next cache level or memory, but not stored in the cache.

A write-through with write-allocate cache, if it does not read clean data,
must necessarily have valid bits for bytes or words within the cache line.
Unless the writes can only be cache line sized.

Actually, we can imagine a write-through with write allocate cache, that reads clean data,
and thereby avoids the need to have byte valid bits.
I was about to say that this somewhat misses the point...
and it certainly would on "write and forget" systems.
But, some systems must obtain ownership even on write-through caches.
E.g. the IBM z-series (descendants of the System 360)
must ensure that all other copies of a cache line are invalidated before
"performing" the write.
If you have to do that, you might almost as well obtain the clean data,
and not have to maintain byte dirty bits.

* P6 family: the write-through invalidates other caches, but the data can be read from the local cache before the remote invalidations. Do not need to get a reply from the invalidation.
* IBM family: must invalidate before forwarding locally; i.e. must wait for the invalidation to be complete.