Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, June 21, 2009

Blogging from ISCA: WDDD: Is TM easier than locks?

Is TM really easier than locks? Programmers must still write critical sections.

Experiment: measure on a class of inexperienced programmers in a UT OS class.

The problem: sync gallery: a rogue's gallery of synchronization. Rogues shoot paint balls in lanes. Cleaners. Etc.

9 different variations: {single lane, two lane, cleaner thread} X {coarse, fine, TM}

TM support: Year1 DSTM2 (Herlihy 06). Year2 JDASTM (Ramadan 09). Library, not langage support.

DSTM syntax baroque. "thread.doit(method)" == thread.execute_method_as_transaction(method).

JDASTM: txbegin, txend.

Survey included experience, hours, rankings

147 students

Hours reported:

single lane: coarse grained > tm > fine grain. They did coarse grained first.


Best syntax: coarse > (TM|Coarse) > fine > (fine|conditions)

Easiest to think about: coarse > fine|conditions > tm

Defects tested using condor.

Errors taxonomy:
  • lock ordering - 8%
  • lock cond - 7%
  • lock forgot - 15%
  • cv exotic - 8%
  • cv use - 11%
  • tm exotic - 3%
  • tm forgot ..
  • tm order - not really a bug ..
TM had far fewer errors.

---

Wants support for atomic blocks.

---

I asked if there were programmer visible abort actions - if there was contention. There was not.
AFG opinion: abort actions will be a source of errors. Hard to test. I like transactions, but I hate increasing non-determinism.

Audience Q: learning curve. Why not teach TM first, and then translate into locks?

I was surprised at syntax preference for coarse grained locks. IMHO txstart/txend is almost equivalent to lock/unlock coarse grained locks. The presenter was similarly puzzled.




Blogging from ISCA: WDDD: Emmett Witchel: Mondriaan-like

Wrote paper: lots of people need metadata. Mondriaan. Pixie dust.

Mondriaan is a general solution, but must be engineered to fit a specific problem.

Mukti-paper projects. Emphasize novelty. Features disappear because they didn't work; lack of space; lack of novelty.

Mondriaan gave metadata for each 32-bit word. No alignment restrictions. No warts.

MMP hardware: like page table. Perm Table Base. Perm Table. Protection Lookaside Buffer.

Mondriaan tradeoffs:

Space:

PLB refill costs: minimize refills / maximize reach.

SW overheads writing table entries: infrequent updates and/or simple table updates.

Mondriaan stored permissions data, but can handle arbitrary metadata.

Talk is something like a design rationale for Mondriaan. 2 bit permissions for each word. (Byte?) Run length encoded. Zones <= 4 for 16 words. Fallback to pointer to a bitmap entry.

(AFG: I'm not interested in zones. Bad security model. But the metadata mechanisms may be interesting.)

RLE made writes slow.

Mondrix wrote permissions table entries a lot. RLE made it slow. Fell back to using bitmap entries. Coarser grained protection. Page granularity.

Emmett walks through examples of other work using Mondriaan-like metadata.

Colorama. Luis Ceze HPCA 2007. Explicitly uses Mondriaan. Colors data structures.
Measured dynamic memory allocations (malloc) every 1,900 instructions.

Loki. OSDI 2008. All data has 32 bit security tag. Tags per page, tags per word.

Concusion:

Balance space, PLB reach, write overhead.

PLB reach neglected.

Measuring write overhead maybe too hard for research project: suggests counting, multiplying by a cost estimate.

Audience Q: Dave Christie. PLB coherency? A: Mondrix did not deal with.

Dave Christie of AMD (also of my high school) asked about single threaded coherency and multithreaded. It sounds like he has some familiarity with the issue.
I think that it is also interesting that David chose this talk to attend.

Colorama: Gabriel Loh, Luis Ceze, Emmett: not totally practical, but interesting. Luis: had some deadlocks, might need TM support as well.

Blogging from ISCA: CARD: Panel Session, Multicore Programming

Arvind, David August, Keshav Pingali

Phrased as a debate between A and P

The motion: How should the potential of multicore processors be extracted by applications programmers?


Arvind: Intro



Implicit vs. Explicit

...composability...

Too much detail => programming to hard. But is there an appropriate level of detail.

(AFG: infinite fine grain CPUs. Expose memory, cache, line size.)

Is expressing non-determinism essential to ||?

Speculation? Or not?

Debaters: August for (implicit), Pingali against (implicit, for explicit).

The motion: How should the potential of multicore processors be extracted by applications programmers.

August



NOT parallel programming

NOT parallel compilers

FOR implicitly parallel programming, + parallelizing compilers.

What is implicitly parallel programming.

E.g. "inline" directive on function. - explicit. Implicit inlining done automatically by compiler. Often better than human.

Tim Mattson's list of >150 explicit parallel programming languages/systems.

Q: Is explicit parallel programminginherently more difficult than implicit parallel programming?

Quotes Pingali PLDI, quoting Tim Sweeney, who said that Unrel || tripled SW development.

SQL: ... implicit parallel. Another Pingali quote.


Example: commutative annotation. e.g. indicating orderof rand calls between loop iterations. "It's okay if I get my random numbers out of order."

SPECint2000, modified ~50 lines of code of 2M. "Restored the trend" ...

Is it important to be able to express non-dererminism: YES.

Speculative Execution? YES

Is explicitly || needed by anyone other than compiler writers and systems programmers? NO

Only explicit indications needed are of the form function/anti-function (ad-to-list/remove-from-list).

Allies: Church Turing Thesis

Full Employment for Compiler Writer's Thesis.

Pingali



For explicit.

Necessary evil for applications programmers writing irregular programs, e.g. linked data structures.

YES to non determinism. YES to speculation. (Data value dependent parallelism.)

Delaunay Mesh Refinement.

Don't care non-determinism.

Parallelism depends on run-time analysis. Compiler can't...

(AFG: can compiler generate the code to do the analysis?)


Galois model (non deterministic set iterators.)

Collections approach: libraries implemented parallel by experts,
used by higher level programmers. E.g. database B-trees.

Problems: no fixed set of datastructures.

Even generic datastructures need to be tuned.

In many applications the API functions do little computation. Little time in the datastructures. (AFG: but ||ing e.g. a map execute-over-all...)

August counterexamination of Pingali



A: Are there more libraries than programs?

P: Doesn't matter. App programmer still needs to optimize data structure for his app.

A: since in Q&A, I won't explain how you were wrong.

A: Are there patterns? Cliches? Compiler support for same.
In the past when compilers were a 15% solution, not enough money to develop.

(AFG: how many compilers handle even the Gang of 4 patterns? Let alone parallel programming.)

P: functional languages, compilers, haven't gone anywhere over the years.

A: analysis wrong crutch to lean on. E.g. perl (perlbmk). Parallelism in input set. No amount of analysis, neither compiler nor Perl programmer, will know ||ism.
Runtime analysis.

Audience



Mattav (sp?) UT prof: Parallelism not a problem. Implicit works. Explicit works. Tuning is the problem.

A: compiler needs to observe effects of transformation. Run-time feedback.

Mattav summarized P and A's response as "both of you are saying implicit for locality".

I asked my STL question.

To P: not using STL grounds for firing. Customizing grounds for firing. Maybe just fire them all / wait until ew generation hired?

P: STL doesn't parallelize. Is just the datastructure. (He didn't follow the map applicator). Joe programmers, doman specific. Stephanie programmers, ||. (He didn't say ninja/padawab.)

To A: people using iterators, but not map application. Can your compiler convert?

A responded: there exist tools that change datastructure, e.g. list to vector,
or to skiplist.

...

Audience Q: to A: where do you draw the line between explicit ad explicit.

A: if the annotation brings information that cannot be determined automatically, that's implicit. (AFG: copout. That's explicit. But nevertheless still appropriate.)

Arvind's close



If a course is important enough to teach to grad students, why not teach to freshmen?

Should || programming be taught to freshmen?

A: explicitly || programming taught to junior/senior (3rd/4th year)

P: Teach about parallelism in algorithms first. Later, implicit parallel at Joe level.
Later still, fully explicitly parallel.

Vote



Applications programmers do not need to write explicitly parallel programs.

The motion does not carry.

Blogging from ISCA: I fell behind

I fell behind. Laptop battery died, couldn't find a plug. Session hopped. I'll write more when I get a chance.

TBD:

FutMem tutorial

HP Kauffman datacenters.

Blogging from ISCA: ACLD: Chuck Thakker, Rethinking Data Centers

Chuck Thakker invented the PC and Ethernet.

Chuck is now a post-PC person. He has no laptop. He survives with a SmartPhone. (It wasn't clear if he showed a Windows Mobile SmartPhone or an iPhone.)

Datacenters... shipping containers full of servers.

Anecdote: shipping containers were designed as a system... they succeeded where others had failed because he made them into an ISO standard.

Anecdote: SF longshoremen fought containerization of Port of SF. Oakland did not. SF is no longer a commercial port.

Container advantages:

Side to side airflow not impeded by the server case. There is no case. The container is... Loop airflow. Big impeller fans.

Shock mounting at the server, not the rack.

Aggressive ideas:

Once through water: pump through datacenter, and then on to farms.

Power distribution: reduce conversion steps.

12VDC->1VDC point of load regulators ~90%

AC->12VDC 2 stage 85%. Can do better: combined power factor correction, etc.

AC transformers 98%

Final efficiency ~80%

UPC and backup generators aren't part of the picture until the grid fails.

Datacenter close to big hydro dam => reliability. Not cost.

Now 1 phase AC to rack. Direction 3 phase AC to rack. Balance, lower ripple. 12-20VAC? Select to maximize overall efficiency.

Commodity servers: HP, Rackable, others.

Thakker: "The PC ecosystem is awful. Terrible." IBM non-x86 servers not in PC ecosystem. "Design our own..."

Custom motherboards.

Commodity disks

Cables at front

Redesign power supply

Thakker: whats ECC / checking.

Use lower power processors, notebook rather than server. (Note, opposite HP Kauffman.)

Network switches expensve. Unreliable.

Data center network:

fixed top

limited nodes

no broadcast/multicast

security simpler

load balance

design goals:

get rid of large switches

push complexity to edge

standard link technology, but not standard protocols

he likes monsoon.

short copper, long optical

2 kinds of switch: middle of rack, boundary. Both can be implemnted using Xilinx Virtex 5 FPGAs. Prototypable...

rack

boundary: 10 Gb ports, 64 copper to rack, 64 optical outside container

rack 20 1+1 Gb to servers, 2 x10Gb to container boundary switches

AFG Q: what switches are outside the containers?

route controllers. (I am not sure if those are outside the containers)

likes monsoon valiant load balancing.

boundary switches are synchronous. source routed. less packet like. long lived connections.


Objections:

Commodity hardware ... network switches to connect commodity parts NOT commodity.

...

Conclusion:

Treating data centers as systems, and do full system optimization.


Questions:

Thakker likes ATM. ATM better than Ethernet cause it doesn't lose packets. Chuck allowed to say this, since he invented Ethernet. But IP community refused to cooperate with ATM.

I asked him a question about switches outside the containers. He said, none: the datacenter is nothing except server containers and the NOC (Network Operations Center).

I asked about containers of switches. Chuck had thought, but doesn't like reliability.

Audience Q: Server computers have a 3 year lifetime. The container itself - power, cooling - can be reused - send back to manufacturer, and upgrade computers.

Compare to servers with cooling integrated: then the cooling solution lifetime is reduced to the computer electronics lifetime.

Audience Q: free air cooling. Thakker: outside air variations. (AFG: Think, winter/summer in the Dalles.) Running electronics hotter bad, reliability.

Audience Q: is power density going to increase. Thakker: microchannels, etc., bad. 8KW/rack now. Thakker doesn't think we will see 24KW/rack.

Mark Horowitz: no more than 10 cores per chip.