Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Saturday, June 20, 2009

Blogging from ISCA: AMAS-BT: Pardo, Crusoe

Transmeta Crusoe

VLIW 5 wide

Generic simulation support

Shadowed registers, commit/abort

Gated store buffer - 32 entries x 32 bytes.

load-and-protect - like ALAT (hw trap)

x86 condition code support.

PC support - low memory steering I/O vs. DRAM. A20M. Crusoe had hardware for memory map.

Crusoe: x86 ISA support, entirely software decode.
Pardo argues that x86 decode big & power hungry.

LongRun, voltage scaling. E.g. leave CPU at 90%, mem at 100%

Shade: 100 inst/sim inst. Perf 3:1 int, 1:1 FP.

Crusoe translation: 10,000 inst/inst.

Schedulig to VLIW target harder than Shade's RISC SPARC target.

x86 reuse rates low.

Crusoe summary:

Reliability, x86ness - good

Cost: good 1/2 Intel/AMD

Power: good 1/3 Intel/AMD

Perf: umm...

Crusoe faster than low power parts, but slower than 15W laptop parts.

Compute bound, often faster at lower watts. But there aren't that many compute bound workloads.

Memory/cache traffic: slower.

Low reuse: translation overhead -> slower

Crusoe has system gotchas:

PCI graphics, not AGP.

Software DMA, not overlapped.


How to do a small project:

Automate, automamte, automate.

Reference simulator. Must be fast enough go boot OS.

Fast VLIW simulator (for host): 30 inst/sim inst. (30 I/I)

Never published?

Narrowing: Reverse execution. Cosimulation, compae. "Nexus" binary search for first divergence. => Bit 17 in register 5 is wrong in this context ... I.e. nexus = automatic bug narrowing.

Testing:

conventional, hand written

Random. Biased to interesting cases.

"Test" means "checkable". Crash => failure. Consistency check => suspicious.

AFG Q: MP non-determinism.

AFG: one nice aspect of SW DMA would be reproducibility.

Reverse HW execution in HW VLIW debugger started, no finished.

Single step trough nested fault handling. A complete debuggeris totally transparent.

Fast builds. Check in early and often.

On failure: binary search of checkins.

War stories...

CMS SW allowed working around many hardware bugs.

Not all reg resources were shadowed. Some bugs due to rolling back after non-shadowed state changed. Added rules checker to catc future bugs.

What Pardo would do differently:

Hardware was bottleneck. Changing ISA was *different* for software teams.

Big projects rules of thumb do not apply to small teams.

Better perf studies from get go. (Threw out 1st 2 CMS.)

More software inspection.

--

CMS written in C, gcc extension to provide HW access. More control.

Modest amount of assembly. Modest amount. Including modest amount of self modifying VLIW assembly code.

Not MP.

No comments: