= Monitor/Mwait =
Monitor/Mwait are a pair of instructions introduced as follows:
What: Monitor/Mwait
Where: Intel x86 PNI SSE3
When: 2004 Prescott
Brief description from [[Instruction_Set_Extensions,_a_Rough_History#Monitor-Mwait]]
- Originally intended to be used for fast thread synchronization, to implement "Wait until a memory location has changed" (or "may have" changed, since might exit the wait when cache line evicted but variable unchanged). Really, what people want is a cheap way to "wait until condition on memvar is met", but we implement by spinning and reevaluating the condition, blocking as efficiently as possible to reduce as many spins as possible. "Hardware Event Driven" The Monitor part necessary to handle race conditions in such loops. Unfortunately, Monitor/Mwait had problems that resulted in them not being used outside the OS. MWAIT became a power management instruction, saying to enter a power management state like C3 or C4.
Let's discuss in more detail - not just monitor/mwait, but the class of synchronization operations that these are an exampleof.
== The Ideal: Monitor a Datastructure in Memory ==
What people, at least naive programmers, want to do is to be able to monitor arbitrary datastructures in memory
* arbitrary datastructures,
* involving arbitrary numbers of memory locations 9some of which are contingent, e.g. linked lists)
* evaluating arbitrary conditions.
I.e. people want to do something like
WAIT UNTIL
arbitrary condition on arbitrary datastructure(s)
or
SIGNAL EVENT WHEN
arbitrary condition on arbitrary datastructure(s)
Note the two forms: the first synchronous, the second interrupt or event handler based.
E.g. something like
WAIT UNTIL
there are 500 elements on the run queue
or
WAIT UNTIL
the B tree is unbalanced by more than 2X
; Transactions
Note a first issue: monitoring an arbitrary datastructure in memory may be hard to do, if the datastructure is constantly being updated, relinked, etc.
The programmer really wants to monitor this transactionally.
Later we will return to this issue inspired by hardware or software transactional memory techniques.
To start off with, however, we will retreat, past multiple memory locations known a-priori (non contingent),
all the way to monitoring a single memory location.
== Incremental Monitoring ==
First, let's get the easy part out of the way: many such monitoring conditions can be handled by having any code that changes the datastructure evaluate the condition,
and invoke the handler.
E.g.
insert onto queue( Q, el )
...
if( Q.size > 500 ) call handler_for_queue_gt_500
This works for single threaded code.
It even works for multithreaded code - the handler on one thread may set a shared variable or signal other threads. In many ways, monitor/mwait devolves into how to perform such cross thread notification easily.
It doesn't work when you can't modify or extend the existing code.
E.g. a debugger attempting to evaluate such a condition. However, a debugger is such a special case.
Note that it is usually not necessary to evaluate the whole of a complicated condition on a datastructure. Depending on what has recently happeneed in the modifying code, specialization or early out.
So, let's just get this out of the way, and assume that we still want to do monitor/mwait or the equivalent, between threads, at least on a single variable.
== Monitor/Mwait ==
Restricting ourselves to a single memory location,
we want to evaluate an arbitrary condition on that single memory location.
E.g.
LOOP
tmpReg := load( memLoc )
if( arbitrary_condition(tmpReg) then exit loop
END LOOP
E.g. specific:
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
END LOOP
- The example condition is chosen to be one that is unlikely to be hardwired. In reality, the conditions are usually simpler, such as tmpReg == 0 or -1, etc., which raises the possibility odf hardwiring the condition. With the usual issue of what hardwired conditions should be supported.
Such a loop is semantically "good enough" (so long as the programmer understands that the condition may have been changed by another thread immediately thereafter, e.g.
even before the loop exits; and so long as there is no need for an operation such as if zero then set to 1 that needs to be atomic.
However, such a spin loop can cause unnecessary bus traffic, waste power, etc.
so, what we are looking for with monitor/mwait is a way of (a) saving power, by blocking the spinloop, and (b) possibly allowing the operation to be exported.
The former, (a), is the main story behind monitor/mwait.
The latter, (b), may be considered an advanced topic.
Simply adding a pause inside the loop is considered unacceptable, since it adds latency to detecting the change:
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
sleep for time T
END LOOP
What we want is a way of monitoring the memory location to see if anything has changed.
In a MESI cache protocol, it is sufficient to obtain exclusive ownership of a cache line.
If the cache line is written by another processor, the local processor loses ownership,
and the condition can be re-evaluated:
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
mwait( memLoc ) - wait until another processor may have written the cache line
END LOOP
However, because the condition may initially be true, etc., the mwait is accompanied by a monitor outside the loop:
MONITOR(memLoc)
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
MWAIT( memLoc ) - wait until another processor may have written the cache line
END LOOP
Note that I say "may have",
in "wait until another processor may have written the cache line".
The MWAIT typically unblocks when the cache line in question has been written,
or at least ownership has been transferred so that it may have been written.
But it will also unblock when the cacheline has been thrashed out (e.g. by another thread, if the monitpr mechanism is implemennted in the cache linees, and not via a dedicated monitor reghister),
or after a context switch to another processor,
or after going into a low power mode for a while;
... essntially, whenever you can no longer guarantee that the value has changed.
I.e. it is not monitoring changes; it is monitoring for lack of change, and blocking while lack of change is guaranteed.
== Monitor/Mwait and LL/SC ==
Monitor/Mwait are similar to load-linked/store-conditional
- the so-called RISC form of atomic synchronization,
that only works in systems with caches, at the processor,
which does not [[exporting synchronization operations]].
My tone of voice indicates my attitude:
M/MW, like LL/SC, is limited.
It will not work in all circumstances, neither at the low end nor at the upper end.
however, it may be a point of time solution.
For the most future-proof instruction set
just as I advocate wrapping LL/SC inside instructions such as "atomic increment memory location",
I advocate using M/MW as an implementation mechanism for possible instructions such as
* Wait until memory location is nonzero
Indeed, IBM's virtual machie experience has shown us that it is desirable to identify all sorts
of spinloops to the hardware or VMM, for virtual machines or otherwise.
(This warrants a discussion of it's own - [[encapsulating or advising spinloops]].)
== Climbing Back Up ==
Now that we have single location monitor/mwait spinloops:
MONITOR(memLoc)
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
MWAIT( memLoc ) - wait until another processor may have written the cache line
END LOOP
it is natural to think about extending it to multiple locations.
MONITOR(memLoc1)
MONITOR(memLoc2)
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
MWAIT( * ) - wait until another processor may have written any of the monitored locations
END LOOP
You can even imagine re-evaluating the MONITOR'ed locations, e.g. to handle insertion and deletions on a list being monitored.
Note that you can probably get away with having multiple monitors, and a single MWAIT.
In much the same way as people have proposed multiple load-linked with a single store conditional
loop:
tmp1 := load-linked(memLoc1)
tmp2 := load-linked(memLoc2)
if( condition(tmp1,tmp2) then store-conditional( someMemLoc )
if failure goto loop
Although this starts us on the slippery slope towards transactions.
First, with an explicit store commit phase
loop:
tmp1 := load-linked(memLoc1)
tmp2 := load-linked(memLoc2)
if_unchanged_then_commit_stores_atomically {
store(memLoc1,...)
store(memLoc1,...)
}
else {
goto loop
}
Then relaxing the requirement of users grouping the stores.
The problem with all of these is that the limits are arbitrary.
How many memory locations can you access? 1? 2? 4? An L1 cache in size?