Cores/chip 2x/generation. But not memory bandwidth.
=> even Adobe Photoshop, an obviously parallel app, will have problems on multicores with limited memory bandwidth.
...
Claim: to succeed general purpose products must implement all functions, including price, nearly as well as standalone appliances.
E.g. all in one printer / scanner / fax
E.g. cabling - USB
E.g. cameras vide/still - general purpose failure for now.
More examples: cooking, appliances, etc.
Successful computing appliances:
games, storage, routers
We can't rewrite all the software? But software doubles 0.6-6years.
Windows code doubling 866 days. Linux 2-3 year doubling.
BSD 6 year doubling.
Browsers doubled 216 days, early.
Altmann: software CAN be rewritten quickly, guiven reason.
Altmann shows a slde "Comparing Health", of how much code/dram is related to data, primitive bloat, tiny objects, glue, pointer bloat. Quote about DaCapo.
AFG: Altmann thinks this is bad. I think that it is inevitable, given different organizations and abstractions.
Oracle Parallelism: 11 studies. Hundreds of instructions of ILP.
Natural structure => parallelism:
while !ended task1 .. taskN
where either task i is independent of task j, in same iteration,
or task i is independent between iterations.
E.g. gcc. Functions basically independent. Could parallelize gcc by function.
More oracle parallelism slides...
Parallelism > 5000: matmul, eqntott, fppp ...
500 - 5000 ...
50 - 500
< 50
Smoothability of parallelism onto finite hardware.
Lots of agreement that there exists parallelism.
Altmann proposes function fork after call SpMT (but does not use that name).
Says that old studies have too short a skip window.
Read/write set comparison.
Memory parallel regions.
AFG Q: do you have a good algorithm for computing memory parallel regions?
Example: 4K regions in Livermore Loops has too little independence, 64K regions almost 100% coverage.
Go - 4K window size bad, 64K 70%, 2M almost 90%
Stanford integer 4K windows better than larger windows.
Function ... at depth 3, no dependent loads in floowing 159,050 ins.
Example of where 2 high frequency cores makes more sense than 8 core low frequency.
8->2 => 60% freq gain for 2 core. 2core => more cache, etc.
Altmann shows slides that demonstrate that frequency has grown 2-3x faster than litho
Freq vs litho vs 1971: 78x, vs. 1978 3x
We may be overshooting, overemphasizing multicore.
conclusions
multicore != cmp
more appliances
ok to rewrite software
parallelism exists
tools
but freq still helps
multicore needs more than just finding concurrency
Questions:
Talks about using COW page tables to hold speculative state.
Erik's parallelism numbers were memory parallelism only. He was ignoring register dependencies. This naturally filters out function success/failure, typically returned in a register.
No comments:
Post a Comment