Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Saturday, October 01, 2011

Load-linked/store-conditional (LL/SC)

http://semipublic.comp-arch.net/wiki/Load-linked/store-conditional_(LL/SC)

The load-linked/store-conditional instruction pair provide a [[RISC]] flavor of [[atomic RMW]] [[synchronization]], emphasizing primitives which can be flexibly composed. It can be viewed as a minimal form of [[transactional memory]].

Part of the [[RISC philosophy]] was to espouse [[load/store architecture]] - instruction sets that separated load and store memory operations
from computational or ALU operations such as add or increment. This works fine for single processor operations, but runs into problems
for multiprocessor synchronization.

== Pre-LL/SC ==

After years of evolution, prior to the so-called [[RISC revolution]]
multiprocessor synchronization instruction sets were converging on simple [[atomic RMW]] instructions such as
[[compare-and-swap]], [[atomic increment memory]] or [[fetch-and-add]] and other [[fetch-and-op]]s.
Now, these atomic RMWs can be seen as composed of fairly simple primitives, for example:

[[locked increment memory]]
begin-atomic
tmp := load(MEM)
tmp := tmp+1
store( MEM, tmp )
end-atomic

However, the implementation of begin-atomic/end-atomic is not necessarily simple.

The atomicity can be provided in a simple way:

[[locked increment memory]]
tmp := [[load-locked]](MEM)
tmp := tmp+1
[[store-unlock]]( MEM, tmp )

Where [[load-locked]] may be
* implemented at the memory module
** locking the entire module
** or a limited number of memory locations at the module
** or potentially an arbitrary number of memory locations, using per-location lock-bit [[memory metadata]], e.g. [[stolen from ECC]]

or
* implemented via a bus lock

or
* implemented via a cache protocol
** e.g. an [[address lock]]: acquiring ownership of the cache line, and then refusing to respond to [[snoops or probes]] until the [[address lock]] was released
** or, more primitive: acquiring ownership of the cache line, and then acquiring a [[cache lock]], locking the entire cache or a fraction thereof

Interestingly, implementing locking at the memory module quite possibly came first, since many early multiprocessor systems were not snoopy or bus-based.

So far, so good: [[load-locked]] and [[store-unlocked]] are somewhat RISCy.

But they have a problem: [[load-locked]] and [[store-unlocked]] as separate instructions
raise certain security, performance, and reliability problems in some (but not necessarily all) implementations.

E.g. what happens if a user uses load-locked to lock the bus,
and never unlocks it?
That *might* be interpreted as giving one user exclusive access to the bus.

Obviously, one can eliminate these problems by architecture and microarchitecture.
But doing so is a source of complexity.
Many, probably most, implementations have found it easier to package the atomic RMW
so that the primitive operations are not exposed to the user


== [[Load-linked/store-conditional]] ==

The basic idea behind [[LL/SC]] is that, instead of guaranteeing forward progress with a [[load-locked]] that always succeeds,
but which may deadlock,
[[load-linked]] would employ [[optimistic concurrency control]].
Software would assume that it had worked, perform the operation that you want to be atomic,
and then try to commit the operation using [[store-conditional]].

If the operation was atomic, i.e. if nobody else had accessed the memory location load-link'ed, the store-conditional would proceed.

But if the operation were non-atomic, if somebody else had accessed the memory location, the store-conditional would fail.
And the user software would be required to handled that failure, e.g. by looping.

fetch-and-add:
L: oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
if SC_failed goto L
return oldval

Typically LL/SC are implemented via a snoopy bus protocol:
a memory location is "linked" by putting its address into a snooper.
If another location writes to it, the link is broken so that SC will fail.

Some implementations did not implement an address snooper - they might fail if there was ANY other activity on the bus.
Needless to say, this does not exhibit good performance on contended systems.

Non-snoopy implementations of LL/SC are possible.
E.g. implementing them at a memory module is possible. [[TBD]].

As with more extensive forms of transactional memory, LL/SC has issues with fairness and forward progress. It is not desirable to loop forever in the LL/SC code.
This is solvable - through much the same mechanisms as one uses to implement a [[locked atomic RMW]] with [[load-locked]].

One way amounts to converting a [[load-linked]] into a [[load-locked]] after a certain number of failures.


== Synthesizing Complex RMWs ==

The nice thing about [[LL/SC]] is that they can be used to implement many different forms of synchronization. E.g. almost any form of [[fetch-and-op]].
Say you want [[floating point fetch-and-add]] .. you can build that with LL/SC.
Whereas if you don't have LL/SC, and just have a fixed repertoire of integer [[atomic RMW]]s, you may just be out of luck.

This *is* one of the big advantages of RISC: provide primitives, rather than full solutions.

The problem, of course, is what was mentioned above: "plain" LL/SC may have problems with fairness and forward progress. Mechanisms that solve these problems
for LL/SC may be more complicated than they would be for non-LL/SC instructions.

See [[list of possible atomic RMW operations]].

== [[Remote Atomic RMWs]] ==

The "most natural" implementation of [[LL/SC]] is to pull the data into registers and/or cache of the processor on which the instructions are executing. By "the instructions" I mean those before the LL, the LL itself, between the LL and SC, the SC itself, and after the SC.

This is also the most natural implementation for locked implementations of atomic RMWs:

[[LOCK INC mem]]
tmp := [[load-locked]] M[addr]
dstreg := tmp+1
[[store-unlock]] M[addr] := dstreg

I call this the "local implementation" of the atomic RMW.

However, many systems have found it useful to create "remote implementations" of the atomic RMW.

For example, special [[bus or interconnect]] transactions could be created that indicate that the memory controller should do the atomic RMW operation itself.

For example, the [[NYU Ultracomputer]] allowed many [[fetch-and-add]] operations to be exported. They could be performed at the ultimate destination, the memory controller. But they could also be handled specially at intermediate nodes in the interconnection fabric, where conflicting atomic RMWs to the same location could be combined, forwarding only their net effect on towards the destination.

You can imagine such [[remote atomic RMW]]s as being performed by

[[LOCK INC mem]]
send the command "fetch-and-add M[addr],1" to the outside system

This is the advantage of CISCy atomic RMW instructions: they can hide the implementation.
If the interconnect fabric supports only [[fetch-and-add]] but not [[fetch-and-OR]], then the two respective [[microoperation expansions]] might be:

[[LOCK INC mem]]
send the command "fetch-and-add M[addr],1" to the outside system

[[LOCK FETCH-AND-OR mem]]
oldval := [[load-locked]] M[addr]
newtmp := oldval OR src1
[[store-unlock]] M[addr] := newtmp

For that matter, the CISC implementation migh well use [[LL/SC]] optimistic concurrency control within its [[microflow]].

It is more work to create such a [[remote atomic RMW]] with [[LL/SC]], since the very strength of LL/SC
- that the user can place almost arbitrarily complicated instruction sequences between the [[LL]] and [[SC]]
is also its weakness.
If you wanted to create a [[remote atomic RMW]] implementation that supported just, say, [[fetch-and-add]],
then you would have to create an [[idiom recognizer]] that recognized
code sequences such as:

fetch-and-add:
L: oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
if SC_failed goto L
return oldval

Actually, you might not have to recognize the full sequence, complete with the retry loop.
You would only need to recognize the sequence

oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )

and convert that into whatever bus operations create the [[remote atomic RMW]].

Recognizing a 3-instruction idiom is not THAT hard.
But in general [[it is harder to combine separate instructions that to break up complex instructions in an instruction decoder]].

One might even imagine creating a fairly complete set of instructions executable on the interconnect.

=== The best of both worlds? ===

Let me end this section by describing a hybrid implementation of the CISC atomic RMWs that combines the best of both worlds wrt [[local and remote atomic RMWs]].

[[LOCK FETCH-AND-OP mem]]
oldval := [[load-locked/fetch-and-op]] M[addr]
newtmp := oldval OP src1
[[store-unlock/fetch-and-op]] M[addr] := newtmp
return oldval

If M[addr] is exclusive in the locak cache, then the [[load-locked/fetch-and-op]] and [[store-unlock/fetch-and-op]]
are equivalent to ordinary [[load-locked]] and [[store-unlock]].

If M[addr]] is not present in the local cache, then [[load-locked/fetch-and-op]] is equivalent to sending a remote atomic RMW fetch-and-op command to the external network. The [[store-unlock/fetch-and-op]] may then become a NOP (although it may be used for bookkeeping purposes).

If M[addr] is present in the local cache in a shared state, then we *COULD* perform the atomic RMW both locally and remotely. The remote version might invalidate or update other copies. However, if the operation is supposed to be [[serializing]], then the local update cannot proceed without coordinating with the remote.


== Extending the semantics ==

[[PAC]]:

(Paul Clayton added this.)

Because existing LL/SC definitions either fail or have [[undefined semantics]] if other memory operations are performed, the semantics can be safely extended without sacrificing [[compatibility]]. (While it may be possible in some architectures ([[TBD]]: check whether this is the case for Alpha, MIPS, ARM, Power, etc.) that a non-SC memory operation was used by software to cancel an atomic section, such is generally discouraged and unlikely to have been done.) Such extensibility could allow increasingly more extensive forms of transactional memory to be implemented without requiring additional instructions. While an implementation of an older architecture would not generate an illegal instruction, an architecture which guaranteed failure on additional memory accesses would guarantee either deadlock (which would simplify debugging and fixing) or the use of an already in place software fallback (which, while slower, would maintain correctness). In architectures which use a full-sized register to return the success condition and define a single success condition, that register could be used to communicate failure information.

A natural progression for such extensions might be: to initially constrain the transaction to an aligned block of 32 or 64 bytes or the implementation-defined unit of coherence (This last could cause a significant performance incompatibility but would maintain the semantics.), perhaps with a single write, then to expand the semantics to something like those presented in Cliff Click's "IWannaBit!" where any cache miss or eviction cancels the reservation (For many uses, this would require that the cache be warmed up before the transaction can succeed. If the reservation mechanism is expensive, prefetching data outside of the atomic section might be preferred over retrying the transaction.), perhaps extending writes to an entire cache line, and then to an arbitrary number of memory accesses.

Providing non-transactional memory accesses within the atomic section would be more difficult. It would be possible to define register numbers which when used as base pointers for memory accesses have non-transactional semantics (See [[Semantic register numbers]]). This definition would have to be made at the first extension of LL/SC and it might be difficult to establish compiler support.

A non-transactional extension of LL/SC would be to guarantee the success of the store-conditional under certain limited conditions. E.g., a simple register-register operation might be guaranteed to succeed, providing the semantics of most RMW instructions. Such a guaranteed transaction could itself be extended to allow more complex operations, though it might be desirable to allow a transaction that can be guaranteed to fail if lock semantics are not desired under contention. E.g., a thread might work on other tasks rather than wait for the completion of a heavily contended locked operation.

== See Also ==

* [[locked versus atomic]]
* [[Returning the old versus new value: fetch-and-add versus add-to-mem]]
* [[atomic RMW spinloops and CISCy instructions]]

No comments: