The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Friday, April 07, 2017

Blogging from ISCA: PESPMA: Erik Altmann, Exploiting Hidden Parallelism

Erik Altman, IBM. Bio: http://domino.watson.ibm.com/comm/research_people.nsf/pages/ealtman.index.html

Cores/chip 2x/generation. But not memory bandwidth.

=> even Adobe Photoshop, an obviously parallel app, will have problems on multicores with limited memory bandwidth.


Claim: to succeed general purpose products must implement all functions, including price, nearly as well as standalone appliances.

E.g. all in one printer / scanner / fax

E.g. cabling - USB

E.g. cameras vide/still - general purpose failure for now.

More examples: cooking, appliances, etc.

Successful computing appliances:
games, storage, routers

We can't rewrite all the software? But software doubles 0.6-6years.
Windows code doubling 866 days. Linux 2-3 year doubling.
BSD 6 year doubling.
Browsers doubled 216 days, early.

Altmann: software CAN be rewritten quickly, guiven reason.

Altmann shows a slde "Comparing Health", of how much code/dram is related to data, primitive bloat, tiny objects, glue, pointer bloat. Quote about DaCapo.

AFG: Altmann thinks this is bad. I think that it is inevitable, given different organizations and abstractions.

Oracle Parallelism: 11 studies. Hundreds of instructions of ILP.

Natural structure => parallelism:

while !ended

where either task i is independent of task j, in same iteration,
or task i is independent between iterations.

E.g. gcc. Functions basically independent. Could parallelize gcc by function.

More oracle parallelism slides...

Parallelism > 5000: matmul, eqntott, fppp ...

500 - 5000 ...

50 - 500

< 50

Smoothability of parallelism onto finite hardware.

Lots of agreement that there exists parallelism.

Altmann proposes function fork after call SpMT (but does not use that name).
Says that old studies have too short a skip window.
Read/write set comparison.

Memory parallel regions.

AFG Q: do you have a good algorithm for computing memory parallel regions?

Example: 4K regions in Livermore Loops has too little independence, 64K regions almost 100% coverage.

Go - 4K window size bad, 64K 70%, 2M almost 90%

Stanford integer 4K windows better than larger windows.

Function ... at depth 3, no dependent loads in floowing 159,050 ins.

Example of where 2 high frequency cores makes more sense than 8 core low frequency.
8->2 => 60% freq gain for 2 core. 2core => more cache, etc.

Altmann shows slides that demonstrate that frequency has grown 2-3x faster than litho

Freq vs litho vs 1971: 78x, vs. 1978 3x

We may be overshooting, overemphasizing multicore.


multicore != cmp

more appliances

ok to rewrite software

parallelism exists


but freq still helps

multicore needs more than just finding concurrency


Talks about using COW page tables to hold speculative state.

Erik's parallelism numbers were memory parallelism only. He was ignoring register dependencies. This naturally filters out function success/failure, typically returned in a register.

No comments: