Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Saturday, June 20, 2009

Blogging from ISCA: AMAS-BT: Pardo, Crusoe

Transmeta Crusoe

VLIW 5 wide

Generic simulation support

Shadowed registers, commit/abort

Gated store buffer - 32 entries x 32 bytes.

load-and-protect - like ALAT (hw trap)

x86 condition code support.

PC support - low memory steering I/O vs. DRAM. A20M. Crusoe had hardware for memory map.

Crusoe: x86 ISA support, entirely software decode.
Pardo argues that x86 decode big & power hungry.

LongRun, voltage scaling. E.g. leave CPU at 90%, mem at 100%

Shade: 100 inst/sim inst. Perf 3:1 int, 1:1 FP.

Crusoe translation: 10,000 inst/inst.

Schedulig to VLIW target harder than Shade's RISC SPARC target.

x86 reuse rates low.

Crusoe summary:

Reliability, x86ness - good

Cost: good 1/2 Intel/AMD

Power: good 1/3 Intel/AMD

Perf: umm...

Crusoe faster than low power parts, but slower than 15W laptop parts.

Compute bound, often faster at lower watts. But there aren't that many compute bound workloads.

Memory/cache traffic: slower.

Low reuse: translation overhead -> slower

Crusoe has system gotchas:

PCI graphics, not AGP.

Software DMA, not overlapped.


How to do a small project:

Automate, automamte, automate.

Reference simulator. Must be fast enough go boot OS.

Fast VLIW simulator (for host): 30 inst/sim inst. (30 I/I)

Never published?

Narrowing: Reverse execution. Cosimulation, compae. "Nexus" binary search for first divergence. => Bit 17 in register 5 is wrong in this context ... I.e. nexus = automatic bug narrowing.

Testing:

conventional, hand written

Random. Biased to interesting cases.

"Test" means "checkable". Crash => failure. Consistency check => suspicious.

AFG Q: MP non-determinism.

AFG: one nice aspect of SW DMA would be reproducibility.

Reverse HW execution in HW VLIW debugger started, no finished.

Single step trough nested fault handling. A complete debuggeris totally transparent.

Fast builds. Check in early and often.

On failure: binary search of checkins.

War stories...

CMS SW allowed working around many hardware bugs.

Not all reg resources were shadowed. Some bugs due to rolling back after non-shadowed state changed. Added rules checker to catc future bugs.

What Pardo would do differently:

Hardware was bottleneck. Changing ISA was *different* for software teams.

Big projects rules of thumb do not apply to small teams.

Better perf studies from get go. (Threw out 1st 2 CMS.)

More software inspection.

--

CMS written in C, gcc extension to provide HW access. More control.

Modest amount of assembly. Modest amount. Including modest amount of self modifying VLIW assembly code.

Not MP.

Blogging from ISCA: BIC: MATLAB, CUDA

This just a placeholder. Must take more notes for Pardo.

Blogging from ISCA: AMAS-BT: Pardo, SMC

Dave Keppel, Pardo, Google: Self Modifying Code

Everyone knows SMC is dead, but SMC is alive in the very tools that complain about it: dynamic optimizers, dynamic linkers, etc.

Pardo talks fast.

Detecting SMC via page protection. Slow if data and code in same page.

BitBLT - recomile every 10K instructions.

Debugger watchpoints. Change immediates in code.

Present in real commercial workloads.

Coherency events:

x86 - non.

Hardware instructions: ISCP "something changed". iflush addr. coherency(base,length).
Poor match between application and simulator/emulator. Need to detect what really changed.

Adaptive: default-write protect. Change strategy if too many faults. Fall back to default after a while.

Self checking strategy: check current ibytes against saved copy of original ibytes.

Pardo noted that invalidated code often reappears - thing like debugger watchpoints maychange back to original. "Revalidation". Another use for invaid cache entries.

Shade. SPARC. Iflush addr. But, there were some applications that did not use iflush, but which worked on real hardware.

Transmeta Crusoe. Subpage write protection.

"Fetch imediates" - translate code, but fetch immediates that might have been patched.

Crusoe: lots of retranslations when falling through the gears.

Deoptimized translators: fetch immediates. Translation calls interpreter.

Bad: BT leads to more implementations, more chances of bugs, reduced test coverage.

Performance stability, lack of. Consistent sometimes better than fast.

SMC/ISC. Q: what does ISC stand for?

Hardware support:

Crusoe: 2 write protect bits per page. Subpage WP cache.

Shade: 100 instructions to translate an instruction.

Gill51 - universa simulator.

Blogging from ISCA: AMAS-BT keynote

I'll be at ISCA the next few days.

Today: AMAS-BT workshop.

Antonio G. keynote.

Unfortunately, Antonio G and I share the same initials,AG. I will annotate him by Antion, me by me, or AFG.

Pollack's Rule: I suppose I should be gratified that one of my laws, perf = sqrt(power), is now widespread. I am somewhat chagrined that my old boss, Fred Pollack, has his name asociated with it. He publicized it in some keynotes.

Somehow perf=sqrt(power) has also crept in. And Antonio is multiplying the effects. This was not part of my, or Fred's, formulation. Perf=sqrt(area). We often assume that power, at least leakage, is 1:1 with area, which would imply perf=sqrt(power). But I do not think that this needs to be the case. Leakage may be negligible, and active power also seems to be proportional to sqrt(area). Or even less. This implies that perf=sqrt(active power), or less.

Antonio says that multicore => 1:1 perf increases. 2 cores => 2x parallelism. Q: is this correct? The old rule of thumb is that MP, too, perf=sqrt(#processors).

Antonio makes the EPI = Vdd**2 * Cdyn + Leakage. Handwaves leakage. Says Vdd cannot be lowered. (Me: is this true? Differential signalling?) So argues about Cdyn.

Guest ISA / Host ISA. Me: although part of the story, the real challenge is not BT fro ISA to ISA. The real challenges are (a) coming up with a host uarch and ISA that makes sense - that would make sense if compatibility was not a requirement. (b) Minimizing the cost of dynamic instrumentation and optimization.

I.e. the basic host ISA and uarch must make sense, irrespective of the guest ISA.

With an exception: possibly the host ISA is big, with lots of hint bits. The guest ISA may be small and compact. In this case, perhaps the act of binary translation itself helps. The guest ISA may be considered to be a cpmpact form of the host ISA, with just the semantics. The host ISA may be considered to be an expanded form. The host ISA may be considered to be a cache of performance annotations to the guest ISA.

Antonio: memory checkpointing. AFG comment: easier to BT single threaded or message passing programs, harder shared memory.

Antonio: adapting hardware. Resizing, power gating. AFG: hardware can do simiar adaptation. Software dynamic must have larger time constants.

Antonio: BT advantages include compatibility, both over time and across different microarchitectures (which he calls scalability). He notes that forward compatibility is especially interesting. E.g. old binaries taking advantage of new hardware features, like longer vector registers.

Pardo asked about soft real time workloads. Variability introduced by dynamic systems.