The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, December 01, 2009

Links to MLP, Coherent Threading, Multistar

Urgh. Let me just add some links to the blog, from my Google docs "website" root:

Other Stuff

* MLP Yes! ILP No!
o presentation I gave at ASPLOS 98 WACI session
o preserved for more than 10 years by the session organizer at http://www.cs.berkeley.edu/~kubitron/asplos98/final.html,
+ specifically http://www.cs.berkeley.edu/~kubitron/asplos98/slides/andrew_glew.pdf
o a copy kept on Google docs: http://docs.google.com/fileview?id=F.cb345d6b-c4ac-40c6-9e71-bf5d4d18af55
+ it is unclear if Google docs allows anyone to read this -i.e. it is unclear if one can "publish" to the world an uploaded presentation

* Multistar:
o The Story Behind Multistar: http://docs.google.com/View?id=dcxddbtr_40czbtrtf2
o Multistar PDF (2004): http://docs.google.com/fileview?id=0B5qTWL2s3LcQZDIyZDVmN2EtYjY4MC00YjU2LWE4ZGMtYzk2MmU4M2U2NDQ5&hl=en

* Berkeley ParLab talk on Coherent Threading: (2009)
o Coherent Threading
Coherent Vector Lane Threading (SIMT, DIMT, NIMT)
Microarchitectures Intermediate Between SISD and MIMD, Scalar and SIMD parallel vector
o http://docs.google.com/fileview?id=0B5qTWL2s3LcQNGE3NWI4NzQtNTBhNS00YjgyLTljZGMtNTA0YjJmMGIzNDEw&hl=en

The Story Behind Multistar

The Story Behind Multistar


I've been exploring ideas for large out-of-order machines, such as MultiClusterMultiThreading (MCMT), Multilevel Instruction Windows, and Multilevel Branch Predictors, for years - actually since before I joined Intel for P6 (which was a single level OOO machine), and especially after P6, when I attended the University of Wisconsin, at which I did NOT get my PhD, and did NOT get anything published, especially NOT multilevel branch prediction, but where I gelled my MCMT ideas.

I took these ideas to Intel when I returned in 2000, and then to AMD in 2002.  Of course, I took only my UWisc ideas that were public to AMD, nothing from Intel.  I can't talk about what I did at either Intel or AMD, and it probably won't ever see the light of day. I am happy to see that AMD has announced that Bulldozer in 2011 will be an MCMT machine, even though they switched my definition of cores and clusters around. Even if AMD patents my ideas, they probably won't give me credit.

But anyway...  I left AMD in June 2004, and rejoined Intel in August 2004.  In between was one of the few periods in my career when my work was not immediately assigned to an employer like Intel or AMD.

So, I spent the summer surf kayaking at Oceanside, Oregon.  And, at the last minute, writing up these "MultiStar" ideas.   My goals were three-fold: 

(1) As usual, I just plain love computer architecture.

(2) I wanted to have something that I could start work on immediately if I decided to quit Intel and finish my Ph.D.   (The biggest pain about working at AMD was that I left behind 10 years of ideas that I had created at Intel, that I could not use at AMD.)

(3) Lastly, the idea of getting patents outside of a big company was attractive.  I have almost 100 patents through my employer; why not a few on my own?  Heck, if only I had patented the aspects of the P6 microarchitecture I invented at UIUC in my MSEE, such as my form of register renaming, HaRRM (Hardware Register Renaming Mechanism)...

So I wrote up MultiStar.  I had to go beyond any microarchitecture I had done at Intel or AMD.  I did not use any ideas that belonged to Intel or AMD.  I could only use ideas that were already public, or which I created new and fresh in the summer of 2004.  I had to invent new ways to do things that I had already invented once or twice before, at Intel or AMD.  I had to leave a few parts of the machine unfinished, because I had not invented new ways to replace what I had done earlier.

I called it "MultiStar" because I arbitrarily decided to make it an out-of-order microarchitecture with multi-level everything.  Multilevel branch prediction, I$ (easy), decodxactly the same documener, microcode, renamer, scheduler, register file, instruction window, retirement, datacache.  Multiple clusters. Everything.  I don't necessarily recommend multilevel everything as a way to build a machine, but, surprisingly, the ideas fit together remarkably well.  I think it could be built.

I was especially happy that I invented new ways of building a multilevel instruction scheduler and register file / operand bypass mechanism - solving problems that I had been trying to solve for years at Intel and AMD.  This solution acheives the sort of pleasing elegance that makes you feel confident you have it right. The time and place I invented this sticks in my memory (above the waves in Oceanside), like the time and place I invented the form of register renaming used in P6 (UIUC, Hwu's classroom, winter, pipes banging), and the time and place I invented Intel MMX (driving back from Princeton with Bob Dreyer, after the i750 was cancelled).

I wrote up multistar.  Emailed copies to Hwu and Patt, and a few others.  Joined Intel, disclosing multistar, all umpteen pages of it, as the "Intellectual Property Preceding Employment".

And, oh, yes, assigned multistar to an invention company to apply for patents. Using exactly the same disclosure as I provided to Intel. You can see the patent applications at the USPTO website, since they become public a short while after application.  Unfortunately, I was not able to work on the the patent applications after I rejoined Intel, since I did not want to risk contaminating them.

At the time, I thought that multistar was more than 10 years ahead of what Intel or AMD would consider building. Time will tell.


By the way:  I am quite pissed off by all of the people who say that single-threaded CPU microarchitecture has run into a power wall.  Yes, power is hard, and yes, performance does not go up linearly with number of devices.  Performance only seems to go up as the square root of the number of devices, so-called Pollack's Law.  But performance still goes up.  And power need not be linear in the number of devices.

As I am wont to say, the square root of an exponential is still an exponential.

I am slightly pissed off that saying this seems to put me in the camp of single thread OOO microarchitecture bigots.  I've been working on multiprocessor and multithreaded microarchitectures for years, again since undergraduate. Sure, I like using them to build SpMT, but I am also quite eager to use them to build mulithread systems. If you have parallel workloads.  And I have ideas how to make writing parallel code easier.  I like working on exascale supercomputer architectures with millions of processors and billions of threads.  I have been a loud advocate of highly parallel GPU-style SIMT Coherent Threaded microarchitectures.

I am NOT just a single thread OOO bigot.  I know how to make BOTH single threads and multiple threads run faster.

Single thread OOO microarchitecture ran into the power wall because Willamette was a stupid microarchitecture.  Emphasizing high frequency because it was a marketing gimmick, and because the Willamette microarchitects were not confident about how to build more advanced OOO. Single thread OOO microarchitecture ran into the power wall because the guys building Nhm were weaned on Willamette. And because Intel and AMD became reluctant to do anything that was not incremental.

Willamette had some good ideas.  Even replay, the cause of so much instability, can be used effectively, e.g. with transitive cancellation to prevent replay tornadoes.  But Willamette gave them such a bad reputation that ideas like replay may not be looked at again for 10 years.  (It's already been almost five.)

Actually, multistar is really quite incremental.  It applies a well known technique, multiple layers, to several microarchitecture datastructures.  Working out the details of how to do so is not necessarily obvious.

Multistar was one of the best ways I knew of the build large OOO machines in 2004, building on ideas in the public domain, plus a few weeks of new ideas.  It isn't even the best way I know how, although it does have some ideas that were new at the time. 

Of course, my ideas continue to evolve.

Minor updates to my Google Docs website

Added multistar microarchitecture thoughts from 2004 (that are NOT owned by Intel or AMD).

Added linked to my presentation on Coherent Threading GPU architectures, given at UC Berkeley ParLab in August 2009.