Comp-arch.net wiki on hold from October 17, 2011 - CompArch: 
'via Blog this'
= On Hold =
The comp-arch.net wiki is being put on hold - or at least to a very low boil back burner - 
as of Monday, October 17, 2011.
Reason: I (Andy Glew, the creator of and main contributor to comp-arch.net)
have decided to go back to work in industry as a computer architect at MIPS Technologies.
= (Pre)History of comp-arch.net =
I have long wanted to write a book that I threatened to call "The Art of Computer Architecture".
I would like it to be like Knuth's "The Art of Computer Programming",
except that I am no Don Knuth:  I am willing to compromise, not necessarily provide full academic references,
if in exchange I can document the "folklore" of computer architecture - the things that hardware engineers
know or used to know, but which never seem to get written down, and which get re-invented every decade or so.
The web, and wikis in particular, provided a forum and technology to allow me to write this in small pieces.
At most of the computer companies I have worked at, in particular AMD and Intel
(but also to a lesser extent at Gould and Motorola, prior to the creation of the web)
I have created wikis like this.
Whenever other engineers asked me a question
for which the answer was known, in the literature and/or folklore,
but not necessarily easily accessible,
I would write up a white paper or wiki page explaining the topic.
Bonus: often in so doing I would realize missing alternatives in a taxonomy,
which would lead to new inventions.
These company-internal web and wiki sites proved popular.
Several times, former co-workers who had left industry to become academics 
asked me if they were accessible outside the company.
I always had to say "No".
Intel and AMD never allowed me to create public versions of such websites.
Perhaps they would have given extensive legal review - but I found the process of getting such approval a large disincentive.
The document or wiki or website would be stale before approved.
In September 2009 I left Intel to join Intellectual Ventures.
One of the conditions for my joining IV was that I be allowed to work on such a public website,
http://comp-arch.net.
(I actually created the website during the brief period between leaving Intel and starting work at IV.)
I am incredibly grateful to IV for giving me that opportunity.
Progress on comp-arch.net was slow, and probably not steady, but at least visible.
In two years I had created 300 public pages on comp-arch.net.
In addition, I created a certain number of private wiki pages, sometimes on topics that I thought might be worth patenting,
sometimes because I was afraid that disclosing the topics I was writing about might create conflict with IV.
Even though my employment agreement might give me the right to work on something in public,
I would not want to get in the way of my employer's business or marketing strategy.
Such conflicts would have loomed very large for Intel
- I would have had trouble writing honestly about Itanium at the time when Itanium was Intel's main emphasis
- and were much less of a problem for IV,
but still FUD and self-censorship were an impediment
to work on comp-arch.net.
However, I say again: I am immensely grateful to Intellectual Ventures for giving me the chance to start working on comp-arch.net.
THANK YOU, IV!!!!
If I was confident I could stay state-of-the-art as a computer architect while continuing to work on comp-arch.net
and with Intellectual Ventures, I would keep doing so.
= Present of comp-arch.net =
For reasons such as these I left Intellectual Ventures to return to work in industry as a computer architect.
On October 17, 2011, I joined MIPS Technologies.
At MIPS I do not expect to be able to write pages on comp-arch.net and post them in real time.
I will continue to try to work on comp-arch.net in private,
and occasionally seek approval to make stuff public.
= Working on the Wiki =
I will also take this opportunity to work on the technology of comp-arch.net.
In 2009 I started comp-arch.net using mediawiki.
I have long wanted a better wiki technology.  My wiki wishlist is documented elsewhere on this wiki, and other places.
It includes better support for drawings,
and better support for melding of public and private content
- since it appears that such restrictions will be something I have to live with for the foreseeable future.
The computer hardware industry is not "open" as the wiki philosophy would have it,
except possibly within companies.
= Future of comp-arch.net =
As mentioned above, I hope to continue working on comp-arch.net,
making stuff public occasionally, when permitted.
Plus, if ever I retire, I hope to continue this labor of love.
= It's not just about me! =
Although I have been the main contributor to comp-arch.net,
there have been other contributors.
For example, Paul Aaron Clayton has made some contributions.
I hope that others can continue to work on comp-arch.net
during this time when I must leave it on hold.
If you are interested, please contact me, and I will arrange for access.
(I may need to do a bit of wiki work to separate the old stuff from new stuff.)
= Au revoir, et à bientôt =
I hope to see myself working on comp-arch.net again.
But for the moment, I am excited to be working towards actual shipping product again at MIPS.
Disclaimer
The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other  companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.
See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.
See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.
Monday, October 17, 2011
Sunday, October 16, 2011
Friday, October 14, 2011
Mint to Quicken 2011
Mint to Quicken 2011: 
'via Blog this'
I would be posting this on a Mint.com forum, but I don't want to bother with registering, so I will post it here.
I am yet another person who would like to synch Quicken from Mint, and possibly vice versa.
Although just being able to take transactions from Mint into Quicken would be wonderful. A backup.
In my case, although I am a past Quicken user, I am not currently. Not even a Mint user. I use Yodlee. I would consider switching to any product that allowed me to keep both online and offline accounts.
'via Blog this'
I would be posting this on a Mint.com forum, but I don't want to bother with registering, so I will post it here.
I am yet another person who would like to synch Quicken from Mint, and possibly vice versa.
Although just being able to take transactions from Mint into Quicken would be wonderful. A backup.
In my case, although I am a past Quicken user, I am not currently. Not even a Mint user. I use Yodlee. I would consider switching to any product that allowed me to keep both online and offline accounts.
Wednesday, October 12, 2011
Google Plus Traffic Drops, 1269% Gains Erased
Google Plus Traffic Drops, 1269% Gains Erased: 
'via Blog this'
Google+ is declining.
But I am still using Google+.
For one big reason: the "circles" provide better access control than LinkedIn or FaceBook.
Why do you want access control? Well, it often becomes obvious, looking at LinkedIn or FaceBook updates, when a person from company 1 is interviewing with company 2. Now, you can often turn off the automatic notification - but I know former coworkers/managers who proactively searched LinkedIn and FaceBook, to see if people they knew had suddenly become, e.g. linked directly to people in other companies, or even rivals within the same company.
I.e. YOU SHOULD NOT ENTER JOB INTERVIEW CONTACTS IN LINKEDIN OR FACEBOOK!!!
(Not if there is a possibility that your present employer might retaliate.)
(Yes, there are limited fixes for this in LinkedIn and FaceBook. None satisfactory.)
Now, Google+ circles may help here. Except, as is typical with so many Google apps, they don't scale, in the user interface.
E.g. I have separate Google+ circles for all companies I used to work for, as well as old schools, etc. But when you reach more than a dozen circles, they become a pain to deal with. I currently have 24 circles - that's too many for Google's user interface. It needs some sort of hierarchy - e.g. a meta-circle of Companies containing several sub-circles of particular coompanies, and so on.
As is typical with so many Google apps, the user interface doesn't scale. People repeat, over and over and over again, the mistake of providing a flat set of categories (Google+ circles, Gmail labels), rather than providing structure and nesting. I can just imagine the "data driven" conversation at Google: "prove that people need more than 8 circles", "prove that people want nested circles and labels".
'via Blog this'
Google+ is declining.
But I am still using Google+.
For one big reason: the "circles" provide better access control than LinkedIn or FaceBook.
Why do you want access control? Well, it often becomes obvious, looking at LinkedIn or FaceBook updates, when a person from company 1 is interviewing with company 2. Now, you can often turn off the automatic notification - but I know former coworkers/managers who proactively searched LinkedIn and FaceBook, to see if people they knew had suddenly become, e.g. linked directly to people in other companies, or even rivals within the same company.
I.e. YOU SHOULD NOT ENTER JOB INTERVIEW CONTACTS IN LINKEDIN OR FACEBOOK!!!
(Not if there is a possibility that your present employer might retaliate.)
(Yes, there are limited fixes for this in LinkedIn and FaceBook. None satisfactory.)
Now, Google+ circles may help here. Except, as is typical with so many Google apps, they don't scale, in the user interface.
E.g. I have separate Google+ circles for all companies I used to work for, as well as old schools, etc. But when you reach more than a dozen circles, they become a pain to deal with. I currently have 24 circles - that's too many for Google's user interface. It needs some sort of hierarchy - e.g. a meta-circle of Companies containing several sub-circles of particular coompanies, and so on.
As is typical with so many Google apps, the user interface doesn't scale. People repeat, over and over and over again, the mistake of providing a flat set of categories (Google+ circles, Gmail labels), rather than providing structure and nesting. I can just imagine the "data driven" conversation at Google: "prove that people need more than 8 circles", "prove that people want nested circles and labels".
Tuesday, October 11, 2011
Colorado College | Block Plan
I learned about the Colorado College Block Plan from my daughter's school counselor. Instead off juggling 8 classes at a time, take one class at a time, intensely.  
I find this interesting and attractive - this is almost, for example, how I got into computers, via a summer session Concordia University organized for high school students in the Montreal region.
The school itself has the usual issues with small liberal arts colleges versus universities.
I find this interesting and attractive - this is almost, for example, how I got into computers, via a summer session Concordia University organized for high school students in the Montreal region.
The school itself has the usual issues with small liberal arts colleges versus universities.
Saturday, October 01, 2011
Load-linked/store-conditional (LL/SC)
http://semipublic.comp-arch.net/wiki/Load-linked/store-conditional_(LL/SC)
The load-linked/store-conditional instruction pair provide a [[RISC]] flavor of [[atomic RMW]] [[synchronization]], emphasizing primitives which can be flexibly composed. It can be viewed as a minimal form of [[transactional memory]].
Part of the [[RISC philosophy]] was to espouse [[load/store architecture]] - instruction sets that separated load and store memory operations
from computational or ALU operations such as add or increment. This works fine for single processor operations, but runs into problems
for multiprocessor synchronization.
== Pre-LL/SC ==
After years of evolution, prior to the so-called [[RISC revolution]]
multiprocessor synchronization instruction sets were converging on simple [[atomic RMW]] instructions such as
[[compare-and-swap]], [[atomic increment memory]] or [[fetch-and-add]] and other [[fetch-and-op]]s.
Now, these atomic RMWs can be seen as composed of fairly simple primitives, for example:
[[locked increment memory]]
begin-atomic
tmp := load(MEM)
tmp := tmp+1
store( MEM, tmp )
end-atomic
However, the implementation of begin-atomic/end-atomic is not necessarily simple.
The atomicity can be provided in a simple way:
[[locked increment memory]]
tmp := [[load-locked]](MEM)
tmp := tmp+1
[[store-unlock]]( MEM, tmp )
        
Where [[load-locked]] may be
* implemented at the memory module
** locking the entire module
** or a limited number of memory locations at the module
** or potentially an arbitrary number of memory locations, using per-location lock-bit [[memory metadata]], e.g. [[stolen from ECC]]
or
* implemented via a bus lock
or
* implemented via a cache protocol
** e.g. an [[address lock]]: acquiring ownership of the cache line, and then refusing to respond to [[snoops or probes]] until the [[address lock]] was released
** or, more primitive: acquiring ownership of the cache line, and then acquiring a [[cache lock]], locking the entire cache or a fraction thereof
Interestingly, implementing locking at the memory module quite possibly came first, since many early multiprocessor systems were not snoopy or bus-based.
So far, so good: [[load-locked]] and [[store-unlocked]] are somewhat RISCy.
But they have a problem: [[load-locked]] and [[store-unlocked]] as separate instructions
raise certain security, performance, and reliability problems in some (but not necessarily all) implementations.
E.g. what happens if a user uses load-locked to lock the bus,
and never unlocks it?
That *might* be interpreted as giving one user exclusive access to the bus.
Obviously, one can eliminate these problems by architecture and microarchitecture.
But doing so is a source of complexity.
Many, probably most, implementations have found it easier to package the atomic RMW
so that the primitive operations are not exposed to the user
== [[Load-linked/store-conditional]] ==
The basic idea behind [[LL/SC]] is that, instead of guaranteeing forward progress with a [[load-locked]] that always succeeds,
but which may deadlock,
[[load-linked]] would employ [[optimistic concurrency control]].
Software would assume that it had worked, perform the operation that you want to be atomic,
and then try to commit the operation using [[store-conditional]].
If the operation was atomic, i.e. if nobody else had accessed the memory location load-link'ed, the store-conditional would proceed.
But if the operation were non-atomic, if somebody else had accessed the memory location, the store-conditional would fail.
And the user software would be required to handled that failure, e.g. by looping.
fetch-and-add:
L: oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
if SC_failed goto L
return oldval
Typically LL/SC are implemented via a snoopy bus protocol:
a memory location is "linked" by putting its address into a snooper.
If another location writes to it, the link is broken so that SC will fail.
Some implementations did not implement an address snooper - they might fail if there was ANY other activity on the bus.
Needless to say, this does not exhibit good performance on contended systems.
Non-snoopy implementations of LL/SC are possible.
E.g. implementing them at a memory module is possible. [[TBD]].
As with more extensive forms of transactional memory, LL/SC has issues with fairness and forward progress. It is not desirable to loop forever in the LL/SC code.
This is solvable - through much the same mechanisms as one uses to implement a [[locked atomic RMW]] with [[load-locked]].
One way amounts to converting a [[load-linked]] into a [[load-locked]] after a certain number of failures.
== Synthesizing Complex RMWs ==
The nice thing about [[LL/SC]] is that they can be used to implement many different forms of synchronization. E.g. almost any form of [[fetch-and-op]].
Say you want [[floating point fetch-and-add]] .. you can build that with LL/SC.
Whereas if you don't have LL/SC, and just have a fixed repertoire of integer [[atomic RMW]]s, you may just be out of luck.
This *is* one of the big advantages of RISC: provide primitives, rather than full solutions.
The problem, of course, is what was mentioned above: "plain" LL/SC may have problems with fairness and forward progress. Mechanisms that solve these problems
for LL/SC may be more complicated than they would be for non-LL/SC instructions.
See [[list of possible atomic RMW operations]].
== [[Remote Atomic RMWs]] ==
The "most natural" implementation of [[LL/SC]] is to pull the data into registers and/or cache of the processor on which the instructions are executing. By "the instructions" I mean those before the LL, the LL itself, between the LL and SC, the SC itself, and after the SC.
This is also the most natural implementation for locked implementations of atomic RMWs:
[[LOCK INC mem]]
tmp := [[load-locked]] M[addr]
dstreg := tmp+1
[[store-unlock]] M[addr] := dstreg
I call this the "local implementation" of the atomic RMW.
However, many systems have found it useful to create "remote implementations" of the atomic RMW.
For example, special [[bus or interconnect]] transactions could be created that indicate that the memory controller should do the atomic RMW operation itself.
For example, the [[NYU Ultracomputer]] allowed many [[fetch-and-add]] operations to be exported. They could be performed at the ultimate destination, the memory controller. But they could also be handled specially at intermediate nodes in the interconnection fabric, where conflicting atomic RMWs to the same location could be combined, forwarding only their net effect on towards the destination.
You can imagine such [[remote atomic RMW]]s as being performed by
[[LOCK INC mem]]
send the command "fetch-and-add M[addr],1" to the outside system
This is the advantage of CISCy atomic RMW instructions: they can hide the implementation.
If the interconnect fabric supports only [[fetch-and-add]] but not [[fetch-and-OR]], then the two respective [[microoperation expansions]] might be:
[[LOCK INC mem]]
send the command "fetch-and-add M[addr],1" to the outside system
[[LOCK FETCH-AND-OR mem]]
oldval := [[load-locked]] M[addr]
newtmp := oldval OR src1
[[store-unlock]] M[addr] := newtmp
For that matter, the CISC implementation migh well use [[LL/SC]] optimistic concurrency control within its [[microflow]].
It is more work to create such a [[remote atomic RMW]] with [[LL/SC]], since the very strength of LL/SC
- that the user can place almost arbitrarily complicated instruction sequences between the [[LL]] and [[SC]]
is also its weakness.
If you wanted to create a [[remote atomic RMW]] implementation that supported just, say, [[fetch-and-add]],
then you would have to create an [[idiom recognizer]] that recognized
code sequences such as:
fetch-and-add:
L: oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
if SC_failed goto L
return oldval
Actually, you might not have to recognize the full sequence, complete with the retry loop.
You would only need to recognize the sequence
oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
and convert that into whatever bus operations create the [[remote atomic RMW]].
Recognizing a 3-instruction idiom is not THAT hard.
But in general [[it is harder to combine separate instructions that to break up complex instructions in an instruction decoder]].
One might even imagine creating a fairly complete set of instructions executable on the interconnect.
=== The best of both worlds? ===
Let me end this section by describing a hybrid implementation of the CISC atomic RMWs that combines the best of both worlds wrt [[local and remote atomic RMWs]].
[[LOCK FETCH-AND-OP mem]]
oldval := [[load-locked/fetch-and-op]] M[addr]
newtmp := oldval OP src1
[[store-unlock/fetch-and-op]] M[addr] := newtmp
return oldval
If M[addr] is exclusive in the locak cache, then the [[load-locked/fetch-and-op]] and [[store-unlock/fetch-and-op]]
are equivalent to ordinary [[load-locked]] and [[store-unlock]].
If M[addr]] is not present in the local cache, then [[load-locked/fetch-and-op]] is equivalent to sending a remote atomic RMW fetch-and-op command to the external network. The [[store-unlock/fetch-and-op]] may then become a NOP (although it may be used for bookkeeping purposes).
If M[addr] is present in the local cache in a shared state, then we *COULD* perform the atomic RMW both locally and remotely. The remote version might invalidate or update other copies. However, if the operation is supposed to be [[serializing]], then the local update cannot proceed without coordinating with the remote.
== Extending the semantics ==
[[PAC]]:
(Paul Clayton added this.)
Because existing LL/SC definitions either fail or have [[undefined semantics]] if other memory operations are performed, the semantics can be safely extended without sacrificing [[compatibility]]. (While it may be possible in some architectures ([[TBD]]: check whether this is the case for Alpha, MIPS, ARM, Power, etc.) that a non-SC memory operation was used by software to cancel an atomic section, such is generally discouraged and unlikely to have been done.) Such extensibility could allow increasingly more extensive forms of transactional memory to be implemented without requiring additional instructions. While an implementation of an older architecture would not generate an illegal instruction, an architecture which guaranteed failure on additional memory accesses would guarantee either deadlock (which would simplify debugging and fixing) or the use of an already in place software fallback (which, while slower, would maintain correctness). In architectures which use a full-sized register to return the success condition and define a single success condition, that register could be used to communicate failure information.
A natural progression for such extensions might be: to initially constrain the transaction to an aligned block of 32 or 64 bytes or the implementation-defined unit of coherence (This last could cause a significant performance incompatibility but would maintain the semantics.), perhaps with a single write, then to expand the semantics to something like those presented in Cliff Click's "IWannaBit!" where any cache miss or eviction cancels the reservation (For many uses, this would require that the cache be warmed up before the transaction can succeed. If the reservation mechanism is expensive, prefetching data outside of the atomic section might be preferred over retrying the transaction.), perhaps extending writes to an entire cache line, and then to an arbitrary number of memory accesses.
Providing non-transactional memory accesses within the atomic section would be more difficult. It would be possible to define register numbers which when used as base pointers for memory accesses have non-transactional semantics (See [[Semantic register numbers]]). This definition would have to be made at the first extension of LL/SC and it might be difficult to establish compiler support.
A non-transactional extension of LL/SC would be to guarantee the success of the store-conditional under certain limited conditions. E.g., a simple register-register operation might be guaranteed to succeed, providing the semantics of most RMW instructions. Such a guaranteed transaction could itself be extended to allow more complex operations, though it might be desirable to allow a transaction that can be guaranteed to fail if lock semantics are not desired under contention. E.g., a thread might work on other tasks rather than wait for the completion of a heavily contended locked operation.
== See Also ==
* [[locked versus atomic]]
* [[Returning the old versus new value: fetch-and-add versus add-to-mem]]
* [[atomic RMW spinloops and CISCy instructions]]
The load-linked/store-conditional instruction pair provide a [[RISC]] flavor of [[atomic RMW]] [[synchronization]], emphasizing primitives which can be flexibly composed. It can be viewed as a minimal form of [[transactional memory]].
Part of the [[RISC philosophy]] was to espouse [[load/store architecture]] - instruction sets that separated load and store memory operations
from computational or ALU operations such as add or increment. This works fine for single processor operations, but runs into problems
for multiprocessor synchronization.
== Pre-LL/SC ==
After years of evolution, prior to the so-called [[RISC revolution]]
multiprocessor synchronization instruction sets were converging on simple [[atomic RMW]] instructions such as
[[compare-and-swap]], [[atomic increment memory]] or [[fetch-and-add]] and other [[fetch-and-op]]s.
Now, these atomic RMWs can be seen as composed of fairly simple primitives, for example:
[[locked increment memory]]
begin-atomic
tmp := load(MEM)
tmp := tmp+1
store( MEM, tmp )
end-atomic
However, the implementation of begin-atomic/end-atomic is not necessarily simple.
The atomicity can be provided in a simple way:
[[locked increment memory]]
tmp := [[load-locked]](MEM)
tmp := tmp+1
[[store-unlock]]( MEM, tmp )
Where [[load-locked]] may be
* implemented at the memory module
** locking the entire module
** or a limited number of memory locations at the module
** or potentially an arbitrary number of memory locations, using per-location lock-bit [[memory metadata]], e.g. [[stolen from ECC]]
or
* implemented via a bus lock
or
* implemented via a cache protocol
** e.g. an [[address lock]]: acquiring ownership of the cache line, and then refusing to respond to [[snoops or probes]] until the [[address lock]] was released
** or, more primitive: acquiring ownership of the cache line, and then acquiring a [[cache lock]], locking the entire cache or a fraction thereof
Interestingly, implementing locking at the memory module quite possibly came first, since many early multiprocessor systems were not snoopy or bus-based.
So far, so good: [[load-locked]] and [[store-unlocked]] are somewhat RISCy.
But they have a problem: [[load-locked]] and [[store-unlocked]] as separate instructions
raise certain security, performance, and reliability problems in some (but not necessarily all) implementations.
E.g. what happens if a user uses load-locked to lock the bus,
and never unlocks it?
That *might* be interpreted as giving one user exclusive access to the bus.
Obviously, one can eliminate these problems by architecture and microarchitecture.
But doing so is a source of complexity.
Many, probably most, implementations have found it easier to package the atomic RMW
so that the primitive operations are not exposed to the user
== [[Load-linked/store-conditional]] ==
The basic idea behind [[LL/SC]] is that, instead of guaranteeing forward progress with a [[load-locked]] that always succeeds,
but which may deadlock,
[[load-linked]] would employ [[optimistic concurrency control]].
Software would assume that it had worked, perform the operation that you want to be atomic,
and then try to commit the operation using [[store-conditional]].
If the operation was atomic, i.e. if nobody else had accessed the memory location load-link'ed, the store-conditional would proceed.
But if the operation were non-atomic, if somebody else had accessed the memory location, the store-conditional would fail.
And the user software would be required to handled that failure, e.g. by looping.
fetch-and-add:
L: oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
if SC_failed goto L
return oldval
Typically LL/SC are implemented via a snoopy bus protocol:
a memory location is "linked" by putting its address into a snooper.
If another location writes to it, the link is broken so that SC will fail.
Some implementations did not implement an address snooper - they might fail if there was ANY other activity on the bus.
Needless to say, this does not exhibit good performance on contended systems.
Non-snoopy implementations of LL/SC are possible.
E.g. implementing them at a memory module is possible. [[TBD]].
As with more extensive forms of transactional memory, LL/SC has issues with fairness and forward progress. It is not desirable to loop forever in the LL/SC code.
This is solvable - through much the same mechanisms as one uses to implement a [[locked atomic RMW]] with [[load-locked]].
One way amounts to converting a [[load-linked]] into a [[load-locked]] after a certain number of failures.
== Synthesizing Complex RMWs ==
The nice thing about [[LL/SC]] is that they can be used to implement many different forms of synchronization. E.g. almost any form of [[fetch-and-op]].
Say you want [[floating point fetch-and-add]] .. you can build that with LL/SC.
Whereas if you don't have LL/SC, and just have a fixed repertoire of integer [[atomic RMW]]s, you may just be out of luck.
This *is* one of the big advantages of RISC: provide primitives, rather than full solutions.
The problem, of course, is what was mentioned above: "plain" LL/SC may have problems with fairness and forward progress. Mechanisms that solve these problems
for LL/SC may be more complicated than they would be for non-LL/SC instructions.
See [[list of possible atomic RMW operations]].
== [[Remote Atomic RMWs]] ==
The "most natural" implementation of [[LL/SC]] is to pull the data into registers and/or cache of the processor on which the instructions are executing. By "the instructions" I mean those before the LL, the LL itself, between the LL and SC, the SC itself, and after the SC.
This is also the most natural implementation for locked implementations of atomic RMWs:
[[LOCK INC mem]]
tmp := [[load-locked]] M[addr]
dstreg := tmp+1
[[store-unlock]] M[addr] := dstreg
I call this the "local implementation" of the atomic RMW.
However, many systems have found it useful to create "remote implementations" of the atomic RMW.
For example, special [[bus or interconnect]] transactions could be created that indicate that the memory controller should do the atomic RMW operation itself.
For example, the [[NYU Ultracomputer]] allowed many [[fetch-and-add]] operations to be exported. They could be performed at the ultimate destination, the memory controller. But they could also be handled specially at intermediate nodes in the interconnection fabric, where conflicting atomic RMWs to the same location could be combined, forwarding only their net effect on towards the destination.
You can imagine such [[remote atomic RMW]]s as being performed by
[[LOCK INC mem]]
send the command "fetch-and-add M[addr],1" to the outside system
This is the advantage of CISCy atomic RMW instructions: they can hide the implementation.
If the interconnect fabric supports only [[fetch-and-add]] but not [[fetch-and-OR]], then the two respective [[microoperation expansions]] might be:
[[LOCK INC mem]]
send the command "fetch-and-add M[addr],1" to the outside system
[[LOCK FETCH-AND-OR mem]]
oldval := [[load-locked]] M[addr]
newtmp := oldval OR src1
[[store-unlock]] M[addr] := newtmp
For that matter, the CISC implementation migh well use [[LL/SC]] optimistic concurrency control within its [[microflow]].
It is more work to create such a [[remote atomic RMW]] with [[LL/SC]], since the very strength of LL/SC
- that the user can place almost arbitrarily complicated instruction sequences between the [[LL]] and [[SC]]
is also its weakness.
If you wanted to create a [[remote atomic RMW]] implementation that supported just, say, [[fetch-and-add]],
then you would have to create an [[idiom recognizer]] that recognized
code sequences such as:
fetch-and-add:
L: oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
if SC_failed goto L
return oldval
Actually, you might not have to recognize the full sequence, complete with the retry loop.
You would only need to recognize the sequence
oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
and convert that into whatever bus operations create the [[remote atomic RMW]].
Recognizing a 3-instruction idiom is not THAT hard.
But in general [[it is harder to combine separate instructions that to break up complex instructions in an instruction decoder]].
One might even imagine creating a fairly complete set of instructions executable on the interconnect.
=== The best of both worlds? ===
Let me end this section by describing a hybrid implementation of the CISC atomic RMWs that combines the best of both worlds wrt [[local and remote atomic RMWs]].
[[LOCK FETCH-AND-OP mem]]
oldval := [[load-locked/fetch-and-op]] M[addr]
newtmp := oldval OP src1
[[store-unlock/fetch-and-op]] M[addr] := newtmp
return oldval
If M[addr] is exclusive in the locak cache, then the [[load-locked/fetch-and-op]] and [[store-unlock/fetch-and-op]]
are equivalent to ordinary [[load-locked]] and [[store-unlock]].
If M[addr]] is not present in the local cache, then [[load-locked/fetch-and-op]] is equivalent to sending a remote atomic RMW fetch-and-op command to the external network. The [[store-unlock/fetch-and-op]] may then become a NOP (although it may be used for bookkeeping purposes).
If M[addr] is present in the local cache in a shared state, then we *COULD* perform the atomic RMW both locally and remotely. The remote version might invalidate or update other copies. However, if the operation is supposed to be [[serializing]], then the local update cannot proceed without coordinating with the remote.
== Extending the semantics ==
[[PAC]]:
(Paul Clayton added this.)
Because existing LL/SC definitions either fail or have [[undefined semantics]] if other memory operations are performed, the semantics can be safely extended without sacrificing [[compatibility]]. (While it may be possible in some architectures ([[TBD]]: check whether this is the case for Alpha, MIPS, ARM, Power, etc.) that a non-SC memory operation was used by software to cancel an atomic section, such is generally discouraged and unlikely to have been done.) Such extensibility could allow increasingly more extensive forms of transactional memory to be implemented without requiring additional instructions. While an implementation of an older architecture would not generate an illegal instruction, an architecture which guaranteed failure on additional memory accesses would guarantee either deadlock (which would simplify debugging and fixing) or the use of an already in place software fallback (which, while slower, would maintain correctness). In architectures which use a full-sized register to return the success condition and define a single success condition, that register could be used to communicate failure information.
A natural progression for such extensions might be: to initially constrain the transaction to an aligned block of 32 or 64 bytes or the implementation-defined unit of coherence (This last could cause a significant performance incompatibility but would maintain the semantics.), perhaps with a single write, then to expand the semantics to something like those presented in Cliff Click's "IWannaBit!" where any cache miss or eviction cancels the reservation (For many uses, this would require that the cache be warmed up before the transaction can succeed. If the reservation mechanism is expensive, prefetching data outside of the atomic section might be preferred over retrying the transaction.), perhaps extending writes to an entire cache line, and then to an arbitrary number of memory accesses.
Providing non-transactional memory accesses within the atomic section would be more difficult. It would be possible to define register numbers which when used as base pointers for memory accesses have non-transactional semantics (See [[Semantic register numbers]]). This definition would have to be made at the first extension of LL/SC and it might be difficult to establish compiler support.
A non-transactional extension of LL/SC would be to guarantee the success of the store-conditional under certain limited conditions. E.g., a simple register-register operation might be guaranteed to succeed, providing the semantics of most RMW instructions. Such a guaranteed transaction could itself be extended to allow more complex operations, though it might be desirable to allow a transaction that can be guaranteed to fail if lock semantics are not desired under contention. E.g., a thread might work on other tasks rather than wait for the completion of a heavily contended locked operation.
== See Also ==
* [[locked versus atomic]]
* [[Returning the old versus new value: fetch-and-add versus add-to-mem]]
* [[atomic RMW spinloops and CISCy instructions]]
Subscribe to:
Comments (Atom)
 
