Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Thursday, January 27, 2011

Processor redundancy: FRC/TMR/QMR RAS

http://semipublic.comp-arch.net/wiki/Processor_redundancy

---

[[FRC]] - [[failure redundant computation]].  A  fairly generic term.
However, at Intel prior to 1990 or so it often referred to [[master-checker]] pairs
- microprocessors wired tiogether at the pins,
one chip driving the pins,
the other  comparing what it would drive, were it not configured in FRC checker mode,
to what is actually being driven.
If a different is detected, an error is asserted.

[[TMR]] - [[three module redundancy]] - three processors, voting to choose outcome.
The loser may be deactivated, failing down to [[FRC]].

[[QMR]] - [[quad module redundancy]] - usually 2 [[FRC]] pairs.  NOT a voting scheme.
One [[master-checker]] pair is designated active, and its outputs are  actually used.
The  other [[master-checker]] pair is designated inactive.
If the active pair exhibits a difference,
it is failed, and  the other pair continues the  computation.

The inactive pair follows the computation so that its state will be "hot".
It probably makes sense to compare the inactive  pair's results to the active pair's results,
although if there is such a difference between the pairs
but not within the pairs, it is not clear which can be trusted.

---

[[TMR]] and  other voting schemes requires somewhat challenging external voting logic.

[[FRC]] [[master-checker]] pairs require much less external logic: most logic is within the CPU chip.

[[QMR]] is built out of [[FRC]] [[master-checker]] pairs.
It requires no voting logic.
The  comparison logic is within the chip, as  in [[FRC]] pairs.
You might imagine needing external logic to select which pair's outputs should be used;
however, this may not be necessary
if you trust the einternal logic of an FRC pair to disable its outputs.
I.e. if asserting FRCERR from a checker can reliably disable the master's outputs, then no external logic may be needed.

However, such multiple drivers per signal configurations are now deprecated (circa 2010),
so external muxes may be necessary.

---

Above we have talking about FRC/TMR/QMR between chips.
However, it can be applied to any logic block, potentially within the same chip
(although then chip failures might  corrupt both).

Similarly, we have talked about doing FRC/TMW/QMR RAS  for processors,
but these techniques can be applied to non-processor logic.

No comments: