The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, November 28, 2010



= Monitor/Mwait =

Monitor/Mwait are a pair of instructions introduced as follows:

 What: Monitor/Mwait
 Where: Intel x86 PNI SSE3
 When: 2004 Prescott

Brief description from [[Instruction_Set_Extensions,_a_Rough_History#Monitor-Mwait]]

    Originally intended to be used for fast thread synchronization, to implement "Wait until a memory location has changed" (or "may have" changed, since might exit the wait when cache line evicted but variable unchanged). Really, what people want is a cheap way to "wait until condition on memvar is met",  but we implement by spinning and reevaluating the condition, blocking as efficiently as possible to reduce as many spins as possible. "Hardware Event Driven" The Monitor part necessary to handle race conditions in such loops. Unfortunately, Monitor/Mwait had problems that resulted in them not being used outside the OS.  MWAIT became a power management instruction, saying to enter a power management state like C3 or C4.

Let's discuss in more detail - not just monitor/mwait, but the class of synchronization operations that these are an exampleof.

== The Ideal: Monitor a Datastructure in Memory ==

What people, at least naive programmers, want to do is to be able to monitor arbitrary datastructures in memory
* arbitrary datastructures,
* involving arbitrary numbers of memory locations 9some of which are contingent, e.g. linked lists)
* evaluating arbitrary conditions.

I.e. people want to do something like

   arbitrary condition on arbitrary datastructure(s)


   arbitrary condition on arbitrary datastructure(s)

Note the two forms: the first synchronous, the second interrupt or event handler based.

E.g. something like
   there are 500 elements on the run queue

   the B tree is unbalanced by more than 2X

; Transactions

Note a first issue:  monitoring an arbitrary datastructure in memory may be hard to do, if the datastructure is constantly being updated, relinked, etc.
The programmer really wants to monitor this transactionally.
Later we will return to this issue inspired by hardware or software transactional memory techniques.
To start off with, however, we will retreat, past multiple memory locations known a-priori (non contingent),
all the way to monitoring a single memory location.

== Incremental Monitoring ==

First, let's get the easy part out of the way:  many such monitoring conditions can be handled by having any code that changes the datastructure evaluate the condition,
and invoke the handler.

   insert onto queue( Q, el )
       if( Q.size > 500 ) call handler_for_queue_gt_500

This works for single threaded code.

It even works for multithreaded code - the handler on one thread may set a shared variable or signal other threads. In many ways, monitor/mwait devolves into how to perform such cross thread notification easily.

It doesn't work when you can't modify or extend the existing code.

E.g. a debugger attempting to evaluate such a condition.  However, a debugger is such a special case.

Note that it is usually not necessary to evaluate the whole of a complicated condition on a datastructure.  Depending on what has recently happeneed in the modifying code, specialization or early out.

So, let's just get this out of the way, and assume that we still want to do monitor/mwait or the equivalent, between threads, at least on a single variable.

== Monitor/Mwait ==

Restricting ourselves to a single memory location,
we want to evaluate an arbitrary condition on that single memory location.


       tmpReg := load( memLoc )
       if( arbitrary_condition(tmpReg) then exit loop

E.g. specific:

       tmpReg := load( memLoc )
       if( tmpReg = 42 || tmpReg == -21 ) then exit loop

    The example condition is chosen to be one that is unlikely to be hardwired. In reality, the conditions are usually simpler, such as tmpReg == 0 or -1, etc., which raises the possibility odf hardwiring the condition. With the usual issue of what hardwired conditions should be supported.

Such a loop is semantically "good enough" (so long as the programmer understands that the condition may have been changed by another thread immediately thereafter, e.g.
even before the loop exits; and so long as there is no need for an operation such as if zero then set to 1 that needs to be atomic.

However, such a spin loop can cause unnecessary bus traffic, waste power, etc.
so, what we are looking for with monitor/mwait is a way of (a) saving power, by blocking the spinloop, and (b) possibly allowing the operation to be exported.
The former, (a), is the main story behind monitor/mwait.
The latter, (b), may be considered an advanced topic.

Simply adding a pause inside the loop is considered unacceptable, since it adds latency to detecting the change:

       tmpReg := load( memLoc )
       if( tmpReg = 42 || tmpReg == -21 ) then exit loop
       sleep for time T

What we want is a way of monitoring the memory location to see if anything has changed.
In a MESI cache protocol, it is sufficient to obtain exclusive ownership of a cache line.
If the cache line is written by another processor, the local processor loses ownership,
and the condition can be re-evaluated:

       tmpReg := load( memLoc )
       if( tmpReg = 42 || tmpReg == -21 ) then exit loop
       mwait( memLoc ) - wait until another processor may have written the cache line

However, because the condition may initially be true, etc., the mwait is accompanied by a monitor outside the loop:

       tmpReg := load( memLoc )
       if( tmpReg = 42 || tmpReg == -21 ) then exit loop
       MWAIT( memLoc ) - wait until another processor may have written the cache line

Note  that I say "may have",
in "wait until another processor may have written the cache line".
The MWAIT typically unblocks when the cache line in question has been written,
or at least ownership has been transferred so that it may have been written.
But it will also unblock when the cacheline has been thrashed out (e.g. by another thread, if the monitpr mechanism is implemennted in the cache linees, and not via a dedicated monitor reghister),
or after a context switch to another processor,
or after going into a low power mode for a while;
... essntially, whenever you can no longer guarantee that the value has changed.

I.e. it is not monitoring changes; it is monitoring for lack of change, and blocking while lack of change is guaranteed.

== Monitor/Mwait and LL/SC ==

Monitor/Mwait are similar to load-linked/store-conditional
- the so-called RISC form of atomic synchronization,
that only works in systems with caches, at the processor,
which does not [[exporting synchronization operations]].

My tone of voice indicates my attitude:
M/MW, like LL/SC, is limited.
It will not work in all circumstances, neither at the low end nor at the upper end.
however, it may be a point of time solution.

For the most future-proof instruction set
just as I advocate wrapping LL/SC inside instructions such as "atomic increment memory location",
I advocate using M/MW as an implementation mechanism for possible instructions such as
* Wait until memory location is nonzero

Indeed, IBM's virtual machie experience has shown us that it is desirable to identify all sorts
of spinloops to the hardware or VMM, for virtual machines or otherwise.
(This warrants a discussion of it's own - [[encapsulating or advising spinloops]].)

== Climbing Back Up ==

Now that we have single location monitor/mwait spinloops:

       tmpReg := load( memLoc )
       if( tmpReg = 42 || tmpReg == -21 ) then exit loop
       MWAIT( memLoc ) - wait until another processor may have written the cache line

it is natural to think about extending it to multiple locations.

       tmpReg := load( memLoc )
       if( tmpReg = 42 || tmpReg == -21 ) then exit loop
       MWAIT( * ) - wait until another processor may have written any of the monitored locations

You can even imagine re-evaluating the MONITOR'ed locations, e.g. to handle insertion and deletions on a list being monitored.

Note that you can probably get away with having multiple monitors, and a single MWAIT.
In much the same way as people have proposed multiple load-linked with a single store conditional

    tmp1 := load-linked(memLoc1)
    tmp2 := load-linked(memLoc2)
    if( condition(tmp1,tmp2) then store-conditional( someMemLoc )
    if failure goto loop

Although this starts us on the slippery slope towards transactions.
First, with an explicit store commit phase

    tmp1 := load-linked(memLoc1)
    tmp2 := load-linked(memLoc2)
    if_unchanged_then_commit_stores_atomically {
    else {
        goto loop

Then relaxing the requirement of users grouping the stores.

The problem with all of these is that the limits are arbitrary.
How many memory locations can you access?  1? 2? 4?  An L1 cache in size?

Sunday, November 21, 2010

Flowchart Connectors: Coming Soon - Google Docs Help

I am very happy to learn that proper diagram drawing - connectors that stay connected - is coming soon to Google draw.

Flowchart Connectors: Coming Soon - Google Docs Help

This will put me in a quandary: this makes Google drawings almost exactly what I need for my comp-arch.net wiki.

Except that the wiki aspects of Google docs and Google sites are pretty darn poor.

The URLs are unreadable. And there doesn't seem to be a way to create a link to a page that does not yet exist yet, that you can click on to create the page, like [[New Idea To Be Filled In Later]] - which IMHO is the essence of wiki.

Moving programming languages beyond ASCII text

Just read Pul-Henning Kamp's CACM article "Sir, Please Step Away from the ASR-33!"


Constraining ourselves to use ASCII has led to less readable programming languages and programs, and probably has also led to bugs and security breakins, as Kamp suggests with C's & and &&.

Kamp advocates using unicode, with "the entire gamut of Greek letters, mathematical and technical symbols, brackets, brockets, sprockets, and weird and wonderful glyphs..."

I am not so sure that I would go all the way to full unicode, if for no other reason that I don't no how I would edit a program that used "...Dentistry symbol light down and horizontal with wave" (0x23c7)", as Kamp finishes the previous quote. People complain about APL being write-only! But certainly a subset of unicode - subscripted letters in the major alphabets, and ideograms.

Let us end not just the tyranny of ascii, but the tyranny of English. Variable names with accents for the rest of the Europeans; variable names in Chinese and Japanese ideograms. Internationalization systems for program variable names and keywords!!!!

Okay, maybe that goes a bit far - but nevertheless I feel the pain of non-English native speakers.

Kamp goes on to talk about using wide screens, e.g. "Why can't we pull minor scopes and subroutines out in that right-hand space...?"

I'm not sure about that, but I am a big user of tables. Truth tables, etc., that make it easy to see when you have handled all of the possible cases. Arrange the code in a table; heck, click through to zoom in and see details if the code in a table cell gets too hard to read even on our wide screens.

And Kamp goes on to talk about using color for syntax. Again, amen... although that raises the possibility of red/green colorblindness causing new bugs.

Myself, I learned to program from theAlgol 60 primer, with boldfaced keywords.

I think, however, most of this is presentation. We need to start using something like XML as the basis of our languages. It can express all of the above. It can be presented using color, fonts, etc, on wide and narrow screens. It can be customized to your local display devices using CSS.

I know: many people hate XML.  I suggest it mainly because it exists, and is sufficient, and there are many existing tools that can be leveraged.

And did I mention language extensions via domain-specific languages? Many conference speakers and famous California professors advocate them.

Me, I am mixed about domain specific languages: I want to express things as natural in my domain, but I want all of my existing tools, my emacs modes, my interface generators, etc., to have a chance of handling them. More important than anything, it can be parsed by language processors that do not necessarily know the syntax of the domain specific language in use.

Think of how hard it was to extend C to C++: whenever they added a new keyword, existing programs using that keyword as a variable broke. This will not happen if your keywords are distinguished in XML using

< keyword > if < / keyword >
< keyword ascii="if" / >
< if type="keyword" / >  
I don't terribly care which representation is used, so long as there are standard tools to convert back and forth to a form that my other tools can expect to receive.

(Although the multiple tries it took me to get Google's blogging software to type  XML < literals > - I had to add spaces - shows how much tool change is needed.)

Tuesday, November 16, 2010


I keep plodding along:


wikied, USEnet comp.arch posted, andblogged

= Array Basics =

The classic array is a 2D crosspoint array of memory cells.

In an SRAM based array, the memory cells are cross-linked inverters (or a degenerate version thereof).
In a DRAM based array, the memory cells are capacitors.
In other memory technologies, the memory cells may be a floating gate (which is really a capacitor),
a bit of glass that can be melted, magnetic material, etc., etc.

>2D arrays are possible - but 2D is the basic.
(Actually, 1D memories have been created - e.g. delay lines, magnetic bubble racetrack memory, etc. - but that is not in the scope of this topic.)

Conventionally, one of the two dimensions, let's say X, horizontal, carries word lines, or decoded address lines.
The other, Y, vertical, carries bit lines or data lines.

    (I don't want to say just "bit lines", since multivalued logic is a possibility.)

One of the word lines is activated, and it selects all of the memory cells along that word, and causes them to connect to the bit lines. Either the value is read out, or a new value is written in.

    Usually only one word line at a time is activated. Although you could activate more than one, e.g. and OR or AND the several bits selected for read. Conversely, you could write a value to multiple words at once. However, this is uncommon; it often pushes the edges of circuitry; and some systems will, in fact, have logic to prevent this from ever happening.

= A Single Port =

Sometimes there is a single bitline; sometimes there is a pair of bitlines, typically Q and Q-bar.

Sometimes there is a single wordline, and another wire, often parallel to the wordline, that indicates read or write.

Sometimes there are separate wordlines for read and write.

== Minor Tweaks ==

Read/write is a "diffuse" signal: while there must be a word line per, well, word, for a single ported array we might be able to have a global signal that said "write". Such a global signal could be, well, diffuse - signaled in the ground plane, etc. Possibly by a light shining from the 3rd dimension. Or you could route the write enables vertically, or diagonally. Usually there are not good ideas, costong more than they save in area or power.

However, you can share read/write lines between pairs of wordlines. Or even larger sets of wordlines, at the cost of routing to get the signal to where it would need to go.

See [[byte write enables]] or [[subword write enables]]

= True Multiport Arrays =

The simplest possible array might have a single bitline, a single wordline, and a read/write signal.
Typically two horizontal wires, and 1 vertical.
This would implement a single port that can be used for both read and write.

The next step in complexity would be to have
* a read port, with a wordline and a bitline
* a write port, with a dedicated wordline and a bitline

I call these true ports because almost arbitrary words can be simultaneously accessed in the same array.

    Almost arbitrary... often there are constraints, such as * never writing to the same bitcells (words) simultaneously * never reading and writing the same bitcells simultaneously * sometimes never reading from more than M ports at a time. sometimes, the read limit M=1

In general, each true port costs 1 X and 1 Y wire.
This implies that true ports have an area cost of O(N2).

    At least. Many technologies require dual bitlines, or, two bitlines to write, 1 to read. Note the possibility of a hybrid port with three wordlines and two bitlines, using bit/bit-bar to write, and reading on each bitline separately.

Simple arrays with small numbers of ports may not be wire limited: they may be limited by the bitcells.
But, as the number of ports grows, eventually the O(N2) effect kicks in.

    == Enumerate Ports when designing Arrays == Because of the possibilities of odd tradeoffs, such as combined read;/write ports, or 1 write 2 read ports, it is desirable to enumerate all of the ports you need in your [[abstract array design]], and also to think about which ports may not need to be simultaneously active.

= Array Replication for Read Multiporting =

One way to mitigate the O(N2) effect of ports is to replicate the array.
E.g. a 2-read port, single write port array
may be implemented as two 1R 1W arrays,
with separate read ports,
but the write ports tied together.

Similarly, 2R-1W has area 9c, whereas 2x1R-1R has area 2(4c) = 8c.
The effect gets bigger as the number of read ports increases.

However, there is often a cost to distributing the wires to the arrays.
Simplistically, if all flows along a single datapath, there is O(n2) cost to "turning" the wires to go to the separate array.
However2, circumstances may vary: it may not be necessary to turn all the wires back into the same datapath, etc.

    ;Bluesky (at the time of writing, 2010): If you can work with the third dimension, the costs of such array replication to get extra read ports is lowered: whereas in 2D routing to a separate array costs O(N62), in 3D routing to a separate array placed on a layer above the first is O(n).

= Time Multiplexing =

Another way to get multiple ports is by time multiplexing.

You might, for example, "double pump" an array to get effectively twice the ports.
This can even be done with wave pipelining.

Similarly, you might know by construction that reads can only occur, in 2 out of every 3 clock cycles, while writes may occur only every third clock cycle.

If you cannot make such guarantees, it is necessary to have a mechanism for handling conflicts.
If you can make average rate guarantees but not instantaneous rate guarantees,
a few buffers may suffice.

= Banks =

Banks are a form of spatial multiplexing.
Similar to array replication, but without tying together the write ports.

Address bits are used to assign requests to particular banks.
This weighs heavily in favor of power of two sized banks,
although [[non-power-of-two-sized indexing]] is possible.
especially for arrays relatively far out in the memory subsystem, such as DRAM (e.g. Matrox had 17 way interleaved memory)
or outer level caches.

Bank selection bits may be chosen so as to interleave - successive words in the logical arrray may be assigned to different banks. Or they may be coarse grained - e.g. the lowest half of the address space is assigned to thefirst bank, tyhe upper half to a second bank.

The two may be intermixed - interleaving, and coarse grain banking.
Indeed, there may be multiple levels of interleaving and coarse grain banking.

Before we leave this topic, I want to mention that cache lines may possibly be subject to both
intra-cacheline-banking, and inter-cacheline-banking.
E.g. a 64B (512b) cache line may be divided into 8B (64b) banks within the cacheline,
while the cachelines are interleaved between coarser scalee banks.

= Terminology: Banks, Ranks, etc. =

Terminology may vary.

I (Andy Glew) tend to follow my standard practice of using [[standard terms, with distinguishing adjectives]].
E.g. I say any division of an array is a bank, and the process of distributing addresses across the banks is an interleaving pattern.

Other people invent different terms:
e.g. banks within a DDR DRAM;
DRAM chips mounted on DIMMS, operated as separate ranks;
memory may be interleaved across different DRAM channels.

= Bank Conflicts and Other Port Conflicts =

Whenever there are not enough ports to meet all needs the possibility of conflicting accesses arises.

More terminology: I sometimes call banking pseudo-multiporting.

With power-of-2 bitfield extraction used to create the bank selection and indexing functions, there often arise "resonances" on common power of two boundaries. Such as 64KiB, 1MiB, etc.

Simple hashing, XORing a few upper bits into the indexing and selection functions, sometimes helps.
Non-power-of-two indexing can also help.

Mechanisms to handle bank conflicts can lead to great complexity.

The simplest case would be to stall one of the requests.
Simple - unless you are working on a statically scheduled machine without pipeline interlocks.

Simple - but of course immediately stalling is impossible in many high speed designs.

Replay or recycling the request is another way of handling bank conflicts:
a fixed pipeline's worth of requests drains out,
and is routed back to the pipeline or arrays inputs to be retried.
This is a good workable idea - but it can cause positive feedback leading to unstable behavior.
See [[Replay Pipelines]].

I don't know of any other alternatives to stall or replay - although there are many flavors of each.
However, the microarchitecture can arrange things so as to mitigate the cost of bank conflicts,
(or other port conflicts):

E.g. P6 had a pseudo-dual-ported, banked, L1 cache array, with 1 read and 1 write port. The read port always took priority, since handling such conflicts on a read is more complex because dependent operations may be scheduled.
Whereas write port conflicts, while affecting bandwidth, do not have dependejnt operations.

Similarly, what I call a [[2 and a half ported array]] is another nice design point:
true dual ports used for two read ports,
with a write port using pseudo-multiport banking.
Bank conflicts on the dual ported reads are avoided, while on the writes bank conflicts can sometimes be tolerated,
or scheduled around.

= Related =
* [[Bank scheduling]]
* [[Read and Write Combining]]

Saturday, November 13, 2010

Content Based Query Install

Throwing documents into my document archive, I realize one of the forms of semantic exposure of deduplication that I want:

When I say "put(file)", if file is already in the archivem I want to get told "Hey, you, the file is already in the archive!!!

Perhaps not under the name I want to put it in.

put(Archive,nameInArchive,file) => warn if already there

put(Archive,nameInArchive,file,force=1) => force install, even if already there

Interactively, I want the default o be "don't install if already there".

Editing History

Messing around with Mercurial to keep my stuff in synch, while working on ulfs.

While I am reasonably happy with my scheme for synchronizing partial checkins and checkouts - not merging, but making candidate for merge - it's not yet ready for prime time. Hence my use of Mercurial.

Can't say continued use of Mercurial, because I have only lately switched to using Mercurial from Git.

Anyway: historically, the main reason for wanting to be able to edit history has been legal. E.g. if a court settlement or a sale meant that you were not supposed to have versions of a file that is determined not to belong to you.

Today, another reason: Mercurial on cygwin kept dying. Looks like it was related to large files. I really wanted to remove those large files from the history, but the "Though shalt not edit history" attitude got in the way. Used hg strip, but that was too much; I really just wanted to delete a few entries. Not enter a counteracting entry, but to delete the original.

Really, I am using DVCS not just for revision control of a single project, but for history tracking of an archive of largely unrelated files.

This should not be that hard. Mercurial docs make noise about names incorporating all content checksums. ... Well, while that might be a good integrity check, it suggests that such content based naming is not so great an idea. If it makes history editing hard.

Realistically, all versions of all objects should have names separate from content hashes.

Scratch that: s/name/key/

All versions of all objects should have (a) content hash names, and (b) non-content hash names.

This applies just as much to the "indexes" that list all file versions in a contour, as to individual files.

The history data structure is the interesting issue: should version-sets point to parent version sets, and, if so, by what name? If V3->V2->V1, what happens if V2 gets deleted. How do we convert V3->...->V1 to V3->V1? Especially if V3->V2 lives outside the current repository?

This suggests we may want to keep version-hashes of objects deleted, if only to act as placeholders, and to prevent them from getting recreated.