The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Friday, April 01, 2011

Reset: Hard, Soft, Cold, Warm

= The need for RESET after REST at power-on =

In the beginning there was [[RESET]]: a signal asserted while powering on, deasserted when power and logic levels were stable, so that the circuit could could be initialized.

Soon thereafter, or even before, there was POWEROK:
* !POWEROK and RESET => do not attempt operation
* POWEROK and RESET => power good, now do initialization
* POWEROK and !RESET => initialization done, running.

But let us not get obsessed by the details of signalling: I'm okay, you're okay, I've initialized and you are ready.
Moving forward: ...

Eventually people realized that they wanted to be able to restore the state of the system to that right after RESET, without going through the full power on sequence.
Hence the concept of [[soft reset]] was born.
On Intel x86 systems the INIT pin or meessage can be considered to be approximately a [[soft reset]].
With the old reset and power on constituting [[hard reset]].

But [[soft reset]]s like INIT cannot recover from all errors. Sometimes a computer is truly hung, and a [[power cycle]] is necessary.
Or, if not a power cycle, an assertion of the RESET signal
so that all ofthe rest of the system is initialized?

Did I mention that, perhaps before [[soft reset]], people would build circuits that could interrupt power with a relay, and thus provooke a power on [[hard reset]] under software control?

Trouble is, after a [[hard reset]] system state might be unreliable, or might be initialized to a true reset state, such as all zeros.
How can you then distinguish a true [[power on reset]] from a [[hard reset]] invoked under software control?
Perhaps used to recover from a system hang?

What you need is a softer hard reset - a [[warm reset]] that acts like a [[hard reset]], asserting the RESET signal to the rest of the system, I/O devices et al,
so that the state of the machine is as close as you can get to a true power on [[cold reset]] as possible.
But where power persists across the [[warm reset]],
so that at least certain status registers can be reliably read.

At first the state that persisted across such a [[warm reset]] might reside only in a battery backed up unit near to the reset state machine.
But when the amount of state that you want to persist grows large,
e.g. the [[MCA (Machine Check Architecture)]] error status log registers,
state inside the [[CPU]] may be allowed, indeed, required, to persist across such a [[warm reset]].

= A possible sequence of progressively harder RESETs for error recovery =

I am sure that you can see that [[cold reset]] versus [[warm reset]] and [[hard reset]] versus [[soft reset]] are not necessarily discrete points but,
as usual, may be points on a spectrum,
a not necessarily 1D ordered list of possible reset mechanisms.

If an error is detected,
e.g. if a processor stops responding
* first you might try sending it a normal interrupt, of progressively higher priority
* then you might try sending in an [[NMI (Non-Maskable Interrupt)]]
** then any of the flavors of [[even less maskable non-maskable interrupt]]s, such as Intel's [[SMI (System Management Interrupt]], some sort of [[VMM]] or [[hypervisor]] interrupt
* then you might try sending a [[soft reset]] message like INIT to the apparently hung processor
* this failing, you may try to do a [[warm reset]] of the hung processor
** although by this time you probably want to reset all of the processors and I/O that are tightly bound to the hung processor
** exactly how you define such a "reset domain" is system specific, although it is often the same as a [[shared memory cache coherency domain]].
* failing this, you might try to use a [[hard reset]] under the control of an external circuit, e.g. at the power supply, that can trip a relay and then untrip it after a time has elapsed
** heck, this can be done at power supply points increasingly distant: inside the PC or blade, at the rack, in the datacenter...
* all of these failing, you can try to notify the user, although by this point that probably is impossible; or you may rely on some external mechanism, such as a user or a watchdog, to try to use ever more extreme forms of resetting the system.

The above tends to imply that there is a linear order of reset mechanisms. This is not necessarily true. You may reset subsystems in heuristic order.

= RESET is a splitting concept =

I.e. overall [[RESET]] is one of those concepts that inevitably split,
whenever you look too closely at it.
Of course, create no more flavors of reset than you need, because each brings complexity.
But inevitably, if your system is successful and lasts for several years,
you will need one or two more flavors of RESET than were originally anticipated.

= Similar =

[[Watchdog timer]]s are a concept that splits similarly to RESET.
Indeed, each level of progressive reset forerror recovery of a hung system is often driven
by a new level of [[watchdog timer]].
Or [[sanity timer]].

(Or [neurosis timer]], or [[psychosis timer]] ... no, I am writing this on April Fool's.)