Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, November 28, 2010

Monitor/Mwait

http://semipublic.comp-arch.net/wiki/Monitor-Mwait


= Monitor/Mwait =

Monitor/Mwait are a pair of instructions introduced as follows:

 What: Monitor/Mwait
 Where: Intel x86 PNI SSE3
 When: 2004 Prescott

Brief description from [[Instruction_Set_Extensions,_a_Rough_History#Monitor-Mwait]]


    Originally intended to be used for fast thread synchronization, to implement "Wait until a memory location has changed" (or "may have" changed, since might exit the wait when cache line evicted but variable unchanged). Really, what people want is a cheap way to "wait until condition on memvar is met",  but we implement by spinning and reevaluating the condition, blocking as efficiently as possible to reduce as many spins as possible. "Hardware Event Driven" The Monitor part necessary to handle race conditions in such loops. Unfortunately, Monitor/Mwait had problems that resulted in them not being used outside the OS.  MWAIT became a power management instruction, saying to enter a power management state like C3 or C4.


Let's discuss in more detail - not just monitor/mwait, but the class of synchronization operations that these are an exampleof.

== The Ideal: Monitor a Datastructure in Memory ==

What people, at least naive programmers, want to do is to be able to monitor arbitrary datastructures in memory
* arbitrary datastructures,
* involving arbitrary numbers of memory locations 9some of which are contingent, e.g. linked lists)
* evaluating arbitrary conditions.

I.e. people want to do something like

 WAIT UNTIL
   arbitrary condition on arbitrary datastructure(s)

or

 SIGNAL EVENT WHEN
   arbitrary condition on arbitrary datastructure(s)

Note the two forms: the first synchronous, the second interrupt or event handler based.

E.g. something like
 WAIT UNTIL
   there are 500 elements on the run queue

or
 WAIT UNTIL
   the B tree is unbalanced by more than 2X

; Transactions

Note a first issue:  monitoring an arbitrary datastructure in memory may be hard to do, if the datastructure is constantly being updated, relinked, etc.
The programmer really wants to monitor this transactionally.
Later we will return to this issue inspired by hardware or software transactional memory techniques.
To start off with, however, we will retreat, past multiple memory locations known a-priori (non contingent),
all the way to monitoring a single memory location.

== Incremental Monitoring ==

First, let's get the easy part out of the way:  many such monitoring conditions can be handled by having any code that changes the datastructure evaluate the condition,
and invoke the handler.

E.g.
   insert onto queue( Q, el )
       ...
       if( Q.size > 500 ) call handler_for_queue_gt_500

This works for single threaded code.

It even works for multithreaded code - the handler on one thread may set a shared variable or signal other threads. In many ways, monitor/mwait devolves into how to perform such cross thread notification easily.

It doesn't work when you can't modify or extend the existing code.

E.g. a debugger attempting to evaluate such a condition.  However, a debugger is such a special case.

Note that it is usually not necessary to evaluate the whole of a complicated condition on a datastructure.  Depending on what has recently happeneed in the modifying code, specialization or early out.

So, let's just get this out of the way, and assume that we still want to do monitor/mwait or the equivalent, between threads, at least on a single variable.

== Monitor/Mwait ==

Restricting ourselves to a single memory location,
we want to evaluate an arbitrary condition on that single memory location.

E.g.

    LOOP
       tmpReg := load( memLoc )
       if( arbitrary_condition(tmpReg) then exit loop
    END LOOP

E.g. specific:

    LOOP
       tmpReg := load( memLoc )
       if( tmpReg = 42 || tmpReg == -21 ) then exit loop
    END LOOP


    The example condition is chosen to be one that is unlikely to be hardwired. In reality, the conditions are usually simpler, such as tmpReg == 0 or -1, etc., which raises the possibility odf hardwiring the condition. With the usual issue of what hardwired conditions should be supported.


Such a loop is semantically "good enough" (so long as the programmer understands that the condition may have been changed by another thread immediately thereafter, e.g.
even before the loop exits; and so long as there is no need for an operation such as if zero then set to 1 that needs to be atomic.

However, such a spin loop can cause unnecessary bus traffic, waste power, etc.
so, what we are looking for with monitor/mwait is a way of (a) saving power, by blocking the spinloop, and (b) possibly allowing the operation to be exported.
The former, (a), is the main story behind monitor/mwait.
The latter, (b), may be considered an advanced topic.

Simply adding a pause inside the loop is considered unacceptable, since it adds latency to detecting the change:

    LOOP
       tmpReg := load( memLoc )
       if( tmpReg = 42 || tmpReg == -21 ) then exit loop
       sleep for time T
    END LOOP

What we want is a way of monitoring the memory location to see if anything has changed.
In a MESI cache protocol, it is sufficient to obtain exclusive ownership of a cache line.
If the cache line is written by another processor, the local processor loses ownership,
and the condition can be re-evaluated:


    LOOP
       tmpReg := load( memLoc )
       if( tmpReg = 42 || tmpReg == -21 ) then exit loop
       mwait( memLoc ) - wait until another processor may have written the cache line
    END LOOP

However, because the condition may initially be true, etc., the mwait is accompanied by a monitor outside the loop:

    MONITOR(memLoc)
    LOOP
       tmpReg := load( memLoc )
       if( tmpReg = 42 || tmpReg == -21 ) then exit loop
       MWAIT( memLoc ) - wait until another processor may have written the cache line
    END LOOP

Note  that I say "may have",
in "wait until another processor may have written the cache line".
The MWAIT typically unblocks when the cache line in question has been written,
or at least ownership has been transferred so that it may have been written.
But it will also unblock when the cacheline has been thrashed out (e.g. by another thread, if the monitpr mechanism is implemennted in the cache linees, and not via a dedicated monitor reghister),
or after a context switch to another processor,
or after going into a low power mode for a while;
... essntially, whenever you can no longer guarantee that the value has changed.

I.e. it is not monitoring changes; it is monitoring for lack of change, and blocking while lack of change is guaranteed.

== Monitor/Mwait and LL/SC ==

Monitor/Mwait are similar to load-linked/store-conditional
- the so-called RISC form of atomic synchronization,
that only works in systems with caches, at the processor,
which does not [[exporting synchronization operations]].

My tone of voice indicates my attitude:
M/MW, like LL/SC, is limited.
It will not work in all circumstances, neither at the low end nor at the upper end.
however, it may be a point of time solution.

For the most future-proof instruction set
just as I advocate wrapping LL/SC inside instructions such as "atomic increment memory location",
I advocate using M/MW as an implementation mechanism for possible instructions such as
* Wait until memory location is nonzero

Indeed, IBM's virtual machie experience has shown us that it is desirable to identify all sorts
of spinloops to the hardware or VMM, for virtual machines or otherwise.
(This warrants a discussion of it's own - [[encapsulating or advising spinloops]].)

== Climbing Back Up ==

Now that we have single location monitor/mwait spinloops:

    MONITOR(memLoc)
    LOOP
       tmpReg := load( memLoc )
       if( tmpReg = 42 || tmpReg == -21 ) then exit loop
       MWAIT( memLoc ) - wait until another processor may have written the cache line
    END LOOP

it is natural to think about extending it to multiple locations.

    MONITOR(memLoc1)
    MONITOR(memLoc2)
    LOOP
       tmpReg := load( memLoc )
       if( tmpReg = 42 || tmpReg == -21 ) then exit loop
       MWAIT( * ) - wait until another processor may have written any of the monitored locations
    END LOOP

You can even imagine re-evaluating the MONITOR'ed locations, e.g. to handle insertion and deletions on a list being monitored.

Note that you can probably get away with having multiple monitors, and a single MWAIT.
In much the same way as people have proposed multiple load-linked with a single store conditional

  loop:
    tmp1 := load-linked(memLoc1)
    tmp2 := load-linked(memLoc2)
    if( condition(tmp1,tmp2) then store-conditional( someMemLoc )
    if failure goto loop

Although this starts us on the slippery slope towards transactions.
First, with an explicit store commit phase

  loop:
    tmp1 := load-linked(memLoc1)
    tmp2 := load-linked(memLoc2)
    if_unchanged_then_commit_stores_atomically {
        store(memLoc1,...)
        store(memLoc1,...)
    }
    else {
        goto loop
    }

Then relaxing the requirement of users grouping the stores.

The problem with all of these is that the limits are arbitrary.
How many memory locations can you access?  1? 2? 4?  An L1 cache in size?