Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, February 21, 2011

Pattern History Invariant Branch Prediction

    Taking a vacation day - actually, a so-called "floating holiday" on today, USA President's Day - so that I could be with my daughter, who is off school, and so that I could pick up my wife at the airport.


https://semipublic.comp-arch.net/wiki/Pattern_History_Invariant_Branch_Prediction
{{Terminology Term}}
[[Category:Branch Prediction]]

= Terminology =

Many branch predictors have a property that I call [[Pattern History Invariant Branch Prediction]].
By this I mean that the cannot change their prediction for a branch, e.g. from taken to not-taken, or vice versa, without having an intervening misprediction.

    Okay, I admit: "Pattern History Invariant" is a bad name. It is pompous, and not very intuitive. I welcome proposal for a different name. Something like "Branch predictor that can't change it's mind without a misprediction"?
      == Bad Joke == Oooh, how about "Can't change its mind without a fight"? I'm tempted to call it a Scottish, Irish, or a "My Wife" branch predictor - but I'll get in trouble for any and all of these, and only mention them because I am humor impaired.

= Why does this matter? =

== [[Timing Invariant Branch Prediction]] ==

Pattern history invariance can be a useful property, because it makes it easier to build
[[Timing Invariant Branch Prediction]].
Which has several good properties, such as making val

In particular, it means that delaying updating the pattern tables used to make predictions until retirement will NOT result in any more mispredictions than would otherwise occur
- any such mispredictions would already be on the wrong path, younger than a misprediction that would occur no matter what.

Delaying the pattern table update from, e.g. branch execution to retirement has the possible benefit that [[wrong path]] execution will not corrupt the branch predictor.

However, it may be pointed out that it is possible that there will be a performance loss because of other benefits of execution on the multiply speculative wrong path that speculatively updating a branch pattern history table may bring.
But to get this, you must have several nested effects:
* there must be a first branch misprediction pending
* a younger branch may update the predictor tables
* this may result in a misprediction underneath the first branch misprediction
* which may itself update the predictor tables
** possibly eliminating a misprediction when execution resumes after the first branch misprediction is repaired
** possibly producing instruction prefetch and other [[good wrong path effects]]

I.e. with [[Pattern History Invariant Branch Prediction]] there must be a misprediction to result in a change of prediction.
However, that misprediction may itself occur, and be repaired, speculatively.
And that misprediction may have other [[good wrong path effects]].

== [[BTB Unrolling]] ==

[[Pattern History Invariant Branch Prediction]]
also enables [[BTB Unrolling]]:
give a current (IP,history) tuple
and a separate pattern table used to make a prediction
one can "unroll" the predictor to produce a trace
of multiple, N, branch (from,to) addresses.
And such a trace unrolling is guaranteed to be as accurate as if the predictor were accessed one branch at a time.

= Examples of Pattern History Invariance =

= Examples of non-Pattern History Invariance =

I was tempted to say
    The most powerful example of a non-pattern history invariant branch predictor is a [[loop predictor]]. E.g. it may predict that the loop closing branch is taken 99 times in a row, and then not-taken the 100th time. Even more advanced loop predictors, such as predicting that if the last few loops were executed were 67,66,65, times, then the next will be executed 64 times, are even less pattern history invariant.
But this is not true: such a loop predictor will always make the same prediction, if the current loop count
is considered to be part of the history that used, along with the branch in question, to index
the pattern tables.

It is necessary to work ahrder to contrive branch predictors that are not pattern history invariant.

TBD: I believe that certain McFarling-style branch predictors may break the pattern history invariance property.
TBD: get the details right.

In particular, in a multi-predictor system such as McFarling,
if you update the predictors and choosers that are NOT being used as well as the predictors and choosers that are being used,
you can "mask" a misprediction in a second predictor,
if a first predictor was being used.
If the choice of which predictor should be used can then change in the pattern table so that
when the same (IP,history) pair is used at a later time the second predictor wins...



= See Also =

* [[Branch predictor state update]]
* [[Timing Invariant Branch Prediction]]

Timing Invariant Branch Prediction

http://semipublic.comp-arch.net/wiki/Timing_Invariant_Branch_Prediction
{{Terminology Term}}
[[Category:Branch Prediction]]

By [[Timing Invariant Branch Prediction]] I mean branch prediction that is independent of timing.

E.g. a branch predictor design that, if you slowed down the clock or changed the pipeline, e.g. by inserting idle pipestages,
would not vary.

Such invariance is very convenient
(1) for validation
(2) for separation of concerns - it allows you to change the pipeline without worrying about its effect on branch prediction accuracy, etc.

However, many microarchitectures do not adhere to this design principle.

Tweaks and adjustments to the branch prediction microarchitecture may be necessary to attain [[Timing Invariant Branch Prediction]].
For example
* Per-branch history often leads to timing dependent branch prediction, which can be remedied by [[path history]]
* Updating the branch history tables can lead to timing dependent branch prediction

= Per-branch history often leads to timing dependent branch prediction =

For example, if you are using a [[per-branch history]] such as a [[TNT branch history]],
in many pipelines there are several clocks between instruction fetch,
i.e. delivering an instruction pointer to the instruction cache and branch predictor,
and instruction decode.
This means that an instruction fetch/branch prediction time you do not know where the branch instructions are;
you only know where the branch instructions are at decode time.

Therefore, unless you stall,
branch prediction may be using a [[per-branch history]] that is several cycles out of date,
and which may miss several branches
from what it would be ideally using if branches could be identified instantaneously.
Moreover, how many branches are missing may depend on timing, such as [[I$]] misses, pipeline stalls, etc.

== Path history enables timing independent branch prediction ==

If it is a problem to have timing dependent branch prediction
caused by per-branch history, this can be assuaged by [[path history]] branch prediction.

Instruction fetch does not necessarily know where the branches are. However, it necessarily does know the sequence of instruction fetch addresses.
If it is possible to create a path history suitable for use in a branch predictor,
e.g. by XORing the instruction fetch pointers,
then this [[path history]] is accurate and timing invariant.
Since XOR hashing is fast, this probably can be acheived.

However, XORing all fetch IPs may not be the best [[path history]] to use in branch prediction.
Creating a hash, probably XOR based, of [[branch from/to addresses]],
suffices to describe the path - although it losses information about branches between instruction fetch blocks.
Hashing such a [[branch from/to addresses]] [[path history]] with the current instruction fetch pointer
is about as good as you can do at instruction fetch,
without identifying individual instructions.

== Combining ... ==

[[Path history]] based branch predictors are usually reported as being more accurate than [[per-branch history TNT]],
so the pipeline adjustments above may help performance as well as providing [[Timing Invariant Branch Prediction]].

However, if they do not, you can obtain a hybrid that provides many of the benefits of [[timing invariant branch prediction]]
along with the possible improved accuracy of [[per-branch history]]:
* use [[path history]] at instruction fetch
* use [[per-branch history]] at the decoder, in a form of [[late-pipestage branch prediction]]

This gives you timing invariance,
but it also gives the [[decoder branch predictor]] the chance to make corrections to the earlier branch prediction.

= Branch History Update Time =

Q: when should the prediction tables, the [[pattern history table (PHT)]], also sometimes called the [[branch histogram]] or [[branch history table (BHT)]],
be updated?

At execution time, or at retirement time.

Updating at retirement time enables [[Timing Invariant Branch Prediction]]

Updating at execution time may cause only minor issues on an in-order machine.
On an out-of-order machine, however, updates may be done out of order.
In either case they may be done speculatively,
for branches that will not actually be retired because of earlier misspeculations.

Furthermore there arises the question of what history or [[stew]] is used to update the pattern history table.
If every branch at execution carries its history or stew with it, no problem.
But if a big complicated history is maintained only at retirement,
some processor designs have updated the pattern table for branches at execution
with a history corresponding to a position in the instruction stream several cycles before the branch.
Not necessarily a consistent number of cycles, either.

Updating the prediction tables a tretirement time seems to avoid these issues.

TBD: performance cost

TBD: latencies of table update - immaterial if [[pattern history invariant branch prediction]].

= Conclusion =

Is [[Timing Invariant Branch Prediction]] an absolutely vital design goal?

Not necessarily - if performance is increased by timing variant branch prediction, so be it.

However, [[Timing Invariant Branch Prediction]] is definitely a nice thing to have:
* it makes validation much easier
* it makes the design more robust, less fragile, less likely to break if you have to add a pipestage or stall late in the design cycle.
* it is usually associated with higher performance rather than lower performance branch prediction algorithms
* it can usually be achieved by a hybrid predictor design.

It is my experience that most timing dependent branch predictors
happened by accident, rather than by design:
* naive designers building a [[per-branch history]] out of a textbook
* naive extension of in-order designs to out-of-order exacerbating unnecessary timing dependence
etc.

[[Timing Invariant Branch Prediction]] is not necessarily a must-have,
but it is always a good thing to keep in mind when designing
your branch predictor and your microarchitecture/pipeline.
It is a pity to lose its benefits due to ignorance rather than deliberation.

Friday, February 18, 2011

CSE (Common Subexpressions) in HW andSW

http://semipublic.comp-arch.net/wiki/CSE_(common_sub-expression)

CSE is an optimization that recognizes common sub-expressions, and saves them and reuses them rather than recalculating them.

E.g.

var1 := (a*b) + c
var2 := (a*b) + d;

can be rewritten as

tmp := a*b
var1 := tmp + c
var2 := tmp + d

This is obvious, right?

It is less obvious when you have, e.g. complex addressing modes such as (base+index*scale+offset).
Should you CSE part of address computation?

Sometimes it is cheaper to recalculate than it is to store, allocating a register that may spill other registers.

[[Memoization]] is a form of hardware CSE, where an expensive execution recognizes recently performed calculations and avoids reperforming them.

[[1/x inverse instructions]] provide greater opportunity for CSE than do ordinary [[divide instructions]].

[[Micro-optimization_primitive_instructions]] provide more opportunity for [[CSE]]ing.

[[Hardware common-sub-expression elimination]] has been proposed - by me (Andy Glew), in my abandoned Ph.D. on [[instruction refinement]],
if by no one elsse. In fact [[nearly all Dragon-book compiler optimizations can be doone in hardware]].


[[Instruction reuse]] is a hardware technique that extenddsCSE dynamically over moderate to large distances.

Micro-optimization primitive instructions

http://semipublic.comp-arch.net/wiki/Micro-optimization_primitive_instructions


By [[micro-optimization primitive instructions]] I mean instructions that expose the internal steps and partial results of what are otherwise considered to be primitive instructions.

One of the best academic and published examples is in the Dally paper referenced at the bottom:
[[Micro-optimization of floating-point operations]].
However, other examples have been found in actual instruction sets, particularly before transistor counts increased to the point where fully pipelined floating point [[FADD]] and [[FMUL]] instructions were common.

= Examples of Floating-Point Micro-Optimization Instructions (from Dally paper) =

Examples from Dally paper: (transcribed not quite exactly)

;Automatic Block Exponent
:Identifying the largest exponent is cascaded additions, aligning all mantissae and adding without normalization.
* Saves: power of intermediate normalizations and roundings, unnecesary shifts
* Cost: Precision, unless extra mantissae width.See also [[block floating-point]], [[superaccumulator]].

;Shift Combining
:An alternative to automatic block exponent used in casxes where order of operations must be preserved: avoids repeatedly shifting left for normalization, and then shifting right for alignment.
;I.e. combines normalization and alignment shifts.
* Saves: power in redundant shifts.
* Cost: precision, sometimes

;Post Multiply Normalization
:Maintain extra guad bits to the right of mantissa, to avoid normalizzation shift, since multiplication of normalized numbers can denormalize by at most one bit position. (Denorm*norm can produce greater shifts.)

;Conventional Optimizations
:E.g.
(A + B) * (A - B)
can [[CSE (common sub-expression)]] the alignment of A and B.
* Saves: power
* Cost: the exposed micro-optimization, possibly requiring a few extra bits of mantissa and exponent.

;Scheduling
:Exposing primitive operations permits more effective optimization by compiler (or, for that matter, by a dynamic or [[OOO]] scheduler).

Dally suggests a fairly horizontal instruction format for these microoptimizations, consisting of
* Exponent Ops
** E+, E1 - exponent add/subtract (EC <- EA op EB) ** FF1 - returns shift required to normalize mantisssa MA (EC <- FF1 MA) * LDE, STE - load or store exponent as integer * Mantissa Ops ** M+, M-, M* - mantisssa add, subtract, multiply (MC <- MA op MB) ** SHR, SHL - mantissa shift left or right; supports [[shifting by a negative count]] in either dirrection ** ABS, NEG - zeroes and complements the mantissa sign bit ** LDM, STM - loads and stores mantissa as integer ** LDF, STF - load and store mantissa and exponent together as standard floating point number * Branch Ops ** BNEG - [[branch on exponent negative]] ** Bcond - [[branch on exponent and mantissa compare]]] (EA,MA) relop (EB,MB) = Other Possible Micro-optimizations = == Booth Encoding == Many processors use Booth encoding and multiplierarrays, for both integer and floating point. Booth encoding involves (a) preparing multiples of one operand. Unfortunatyely, the actual multiples depeend on how advanced the multiplier is - {-1,0,+1}, {-2,-1,0,+1,+2}, 3X, 4X, etc. (b) using the second operand selecting which multiples of the first operand are to be fed into the array for each digit group. We can certainly imagine [[CSE]]ing the work in preparing such multiples, for eother operand. Unfortunately, the multiples needed change with the exact multiplier used - one microarchitecture might require produced -1X and 3X multiples (1X and 2X are "free", shifting - while another might require 5X, etc. Therefore, the number of bits needed might change from chip to chip. We can imagine storing these large intermediate results in a SIMD packed vector, via an instruction that looks like: vreg256b <- booth_encode(64b) Heck - that might be a useful operation to provide even if not doing microoptimization. It has uses. The details of the Booth encoding used might be hidden. However, perhaps an easier approach might be to do this microoptimization in microarchitecture: e.g. have a [[dynamic scheduler]] recognize several multiplications that share a common first operand, and then decide to skip the [[Booth encoding]] pipestage. The skip might shorten the pipeline - or it might just be used to save power. This amounts to [[memoizing]] the Booth encoding. == Virtual Memory Translations == There are many, many, repeated translations of nearby virtual addresses that result in the same physical page address. ; CacheTranslation with Base Registers E.g. Austin and Sohi suggested caching the translations next to the base register, and thereby avoiding TLB lookup, and supporting more translatuons on a low ported TLB.

    Todd M. Austin and Gurindar S. Sohi. 1996. High-bandwidth address translation for multiple-issue processors. In Proceedings of the 23rd annual international symposium on Computer architecture (ISCA '96). ACM, New York, NY, USA, 158-167. DOI=10.1145/232973.232990 http://doi.acm.org/10.1145/232973.232990
Equivalently, power can be saved.


; Instructions to Save Translations
We can imagine that this might be exposed to software as an instruction that looks like
treg := [[translate_virtual_address]]( virtual address )
dreg := load_using_physical_adddress( treg + offset )
store_using_physical_address( treg + offset)

The first instruction saves the translation, and any permissions bits required, in treg.
treg might be an ordinary address sized integer register, or a special register to hold such a translation.

Note: this instruction, [[translate_virtual_address]], is often requested even outside the context of micro-optimization.
See [[uses of an instruction to save virtual to physical address translation]].

The load and store operations use a saved translation. I would imagine that they would trap if the memory location specified by offset and sizze fell outside the saved translation.

NOTE: in a [[virtual machine]] environment, a Guest OS executing this instruction would get at most a partial benefit.

;User Mode Instructions to Save Translations
Instructions such as [[load_using_physical_address]] can only be used by privileged code - if software can construct arbitrary physical registers.

But if the saved translation lives in a special treg, translation register, that can only be written by certain instructions,
then unprivileged code could employ this instruction.
Equivalently, this could be a [[privilege tagged capability register]] - possibly an ordinary register, with an extra bit that
gets set by the translate instruction.
The physical memory accesses would require that bit to be set.
Other operations on such a register might clear the tag bit.

This is how [[capability registers]] work for other capabilities;
physical addresses are just an extra capability.

Except... physical addresses can change, e.g. during [[virtual memory]] operations such as [[page fault]]s and [[COW]].

(1) we could expose the [[translation registers]] to the OS, which can treat them as an extended TLB, changing them as necessary.
: This is not unlike what old OSes needed to do when [[multitasking using base registers]].

(2) However, we might then need to re-translate. This could be accomplished by storing the original virtual address inside the treg [[[translation register]].

I.e. we have circled back to Austin and Sohi - except that instead of the physical address being a cache, it is more exposed.

Not completely exposed - it is a [[covert channel security hole]] to expose physical addresses.
But it is an explicitly acknowledged, albeit opaque, data quantity.


== Division Step, etc. ==

Any "step" operation, such as divide-step, multiply-step, sqrt-step,
can be considered as a micro-optimization primitive.

With the concomitant problem, that the algorithm, may vary between versions of a processor.

== TBD: other micro-optimizations ==

TBD

= References =
W. J. Dally. 1989. [[Micro-optimization of floating-point operations]]. SIGARCH Comput. Archit. News 17, 2 (April 1989), 283-289. DOI=10.1145/68182.68208 http://doi.acm.org/10.1145/68182.68208

W. J. Dally. 1989. Micro-optimization of floating-point operations. In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS-III). ACM, New York, NY, USA, 283-289. DOI=10.1145/70082.68208 http://doi.acm.org/10.1145/70082.68208

Saturday, February 12, 2011

RISC versus CISC

http://semipublic.comp-arch.net/wiki/RISC_versus_CISC

* [[RISC (Reduced Instruction Set Computer)]]
* [[CISC (Complicated Instruction Set Computer)]]

[[RISC]] and [[CISC]] have been discussed elsewhere, outside this wiki, in great detail.
I will not try to add much to what has already been written.

However, although I am a RISC sympathizer, I may take a contrarian stance, talking about some of the issues and problems with RISC.
The RISC philosophy was certainly influential. However, in many ways it failed.
I may try to talk about those failures.
And how the so-called [[RISC Revolution]] warped a generation of computer architects.

= Conclusion =

I did not really want to write this article. There is not much to add to all that has been written about the so-called [[RISC Wars]] except to say
that they caused a lot of thrashing, amounted to less than promised, although they did bring some improvements.

However, I could not really write a wiki on computer architecture without mentioning [[RISC versus CISC]].'

I would prefer, however, to discuss interesting aspects of RISC computer architecture, that may not be discussed in many other places,
than to rehash this old debate:

* [[Breaking out of the 32-bit RISC instruction set straitjacket]]
** [[variable length RISC instruction sets]]
** [[16-bit RISC instruction sets]]
** [[RISC instruction sets with prefix instructions]]

* [[post-RISC proliferation of instructions]]

* [[How do you count instructions in an instruction set architecture?]]

* [[Microcoded instructions and RISC]]
* [[Hardwired instructions and RISC]]
* [[Hardware state-machine sequenced instructions and RISC]]

Many of these latter issues are of the form "XXX and RISC", and really should be only of the form "XXX",
except that one of the leftovers of the [[RISC era]] is that it is considered obligatory to explain
how a feature supports or opposes the [[[RISC philosophy]].


= Background =

The late 1970s and early 1980s were an era of increasing complexity of increasing capability and complexity in computers.
The mainframe IBM 360 family was reaching the peak of its influence.
The relatively simple PDP-11 minicomputer
was supplanted by the more complex DEC VAX.
Microprocessors were marching forward: the Intel 8086 started along the road to world domination in 1978,
the Motorola 68000 started along the road to elegant failure in 1979.
Intel had been talking about the
[http://en.wikipedia.org/wiki/Intel_432 Intel iAPX 432]
microprocessor for a while, with features such as bit granular instructions,
garbage collection in hardware and microcode,
a software object model supported by hardware, etc.
People were trying to solve the so-called software gap with hardware, by building computers that more closely mapped
whatever language implementation they were targeting.


Complexity seemed increasing.
In actuality, much of the software complexity was moved into microcode.
Woe betide languages and operating systems that did not match the language and operating system use cases the computer was designed to support.

= The RISC Revolution =

Into this came a bolt of sanity:
    David A. Patterson and David R. Ditzel. 1980. The case for the reduced instruction set computer. SIGARCH Comput. Archit. News 8, 6 (October 1980), 25-33. DOI=10.1145/641914.641917 http://doi.acm.org/10.1145/641914.641917
Heck - it was not even published in a refereed journal!

One must also mention the [[IBM 801 minicomputer]], arguably the first RISC.

The late 1980s and early 1990s were full of RISC computers, especially microprocessors: [[Power]], [[PowerPC]], [[MIPS]], [[SPARC]], [[Motorola 88K]].
Not to mention many more failed companies.
All seemed destined to replace the IBM mainframe, the VAX minicomputer and,
then, as the importance of the PC market grew, the PC microprocessor.
But only the 68000 Apple PC fell to the PowerPC RISC onslaught.
The VAX and other minicomputers died out.
But the IBM mainframe, and the Intyel x86, continued, the latter spectacularly.

Now, by 2010,
* IBM is slowly transitioning to [[Power]], but the IBM Z-series mainframe stays strong
* the PC world is still ruled by Intel [[x86]]
** Intel's RISC and VLIW projects fell by the wayside
* [[PowerPC]] and [[MIPS]] have been relegated to [[embedded computing]]
* [[ARM]] is the biggest RISC success story, in embedded and, particularly, in low power, cell phones, etc.
** [[ARM]] is likely to be Intel's big competition
* Sun [[SPARC]] survives, barely - but Sun itself got swallowed by Oracle
** Sun/Oracle x86 boxes predominate even here
* DEC is long dead, as is [[Alpha]]

= What Happened? =

Many RISC companies went after the high priced, high profit margin, workstation or server markets.
But those markets got killed by [[The March of the Killer Micros]], specifically, the x86 PC microprocessor.
It is telling that ARM is the most successful "RISC", and that ARM targetted embedded and low power, the low end,
lower than the PC market, rather than the high end.

[[PowerPC]] and [[MIPS]] made a concerted attack on the Intel x86 PC franchise.
Microsoft even provided Windows NT support.
But when Intel proved that they could keep the x86 architecture competitive,
and stay the best semiconductor manufacturer, the "RISC PC" withered.

Intel proved they could keep the x86 PC microprocessor competitive in several stages:
* the i486 proved that the x86 could be pipelined.
** Up until then one of the pro-RISC arguments was that CISCs were too complicated to pipeline. But, see the next section
** I was thinking about Motorola 88Ks about this time, when the i486 started being talked about, and I realized - RISC had no fundamental advantage
* the Intel P5/Pentium did in-order superscalar
* the Intel P6, first released as the Pentium Pro, then Pentium II, did out-of-order
** briefly, the fastest microprocessor in the world, beating even DEC Alpha

Some say that the P6 killed RISC.

A more nuanced view is that RISC was a response to a short term issue:
the transition from board level to chip level integration.
When only a small fraction of a board level computer could fit on a chip,
RISC made more sense. When more could fit on a chip, RISC made less sense.

Not no sense at all. The RISC principles always have value.
But less sense, less of a competitive advantage.
Unnecessary complexity is always wasteful.

Moving on...

= The CISCs that survived were not that CISCy =

The CISCs that failed - DC VAX and the Motorola 68000 - were the most CISCy.
Most instructions were variable length.
Some frequently used instructions could be very long.
Many instructions had microcode.
Many operations had side effects.
They had complicated addressing modes - elegant in their generality, but coompliocated, sometimes neceessitating microcode just to calculate an address.


The CISCs that survived - the IBM 360 mainframe family, and the Intel x86 - were not that CISCy.

Sure, they had some complicated, almost impossible to implement without microcode, instructions.
Just consider x86 FAR CAL through CALL GATE.

However, most of their instructions were simple:
ADD src_reg += dest_reg_or_mem.
True, Note [[load-store]].
But [[load-op]] pipelines were not that hard for the IBM mainframes, and the Intel i486 and P5, to implement.
And the Intel P6 showed that the microarchitecture could be [[load-stiore]], even though the ISA was not.

The IBM 360 family has relatively simple instruction decoding.

The Intel x86 has painful instruction encodings, ranging from a single byte up to 15 bytes.
But the most frequently used instructions are small.
The really complicated instruction encodings, with prefixes, proved possible to (a) implement in hardware, but (b) at a significant performance cost, on the order of 1 cycle per prefix originally.

Most important, these instruction sets had few side effects.
Yes, the x86 has condition codes.
But most x86 instructions overwrite all of the arithmetic condition codes (INC and DEC not affecting the carry flag being the notable exception).
This avoided the sort of RAW hazard on the condition codes that would have been required for an OOO implementation of, say, the Motorola 68000.

= Didn't RISC win anyway? =

Did not RISC win anyway? After all, doesn't a modern CISC processor translate to RISC uops internally?

Well, yes and no. Let's look at some of the RISC principles proposed in early papers

* fixed length instructions - 32 bit
** at the [[macroinstruction set level]], not so much
** at the [[microinstruction set]] or [[UISA]] level, maybe
*** maybe- but definitely not "small". Is it a RISC if the (micro)instructions are 160 bits wide
*** even here, the recent trend is to compressing microcode. Some are small, some take more bits.
** the most popular surviving RISC instruction sets have 16 bit subsets to increase code density

* simple instruction decoding
** ISA level, no
** UISA level - undocumented. Probably. But, again, very wide!!

* software floating point
** nope!!

* large uniform register set
** in the early days, not so much
** over time, the register set has grown. As has the complexity of the ISA encoding, [[REX bytes]] etc.

* small number of instructions
** definitely not!!!
** microcode instruction sets have long been full of many instructions, particularly widget instructions
** even macroinstruction sets have increased dramatically in size since 1990. More than quadrupled.

Some have said that the point of RISC was not reduced instruction count,
but reduced instruction complexity.
This may be true - certainly, this was always the argument that I used *against* rabid RISC enthusiasts who were trying to reduce the number of instructions in the instruction set.
But nevertheless, there were many, many, RISC zealots and managers who evaluated proposals by counting instructions.


= What's left? =

The most important intellectual survivor of the RISC philosophy has been to aversion to complicated microcode sequences.
Expanding instructions to 1 to 4 [[uop]]s may be okay,
but taking a cycle or more to branch into a [[microcode sequencer]], perform the operation, and branch back is a performance penalty nobody wants to take.

The number of instructions in instruction sets has increased dramatically over the past decade.
But the vast majority of these are instructions that can be implemented directly by hardware,
by combinatoric logic,
or by simple state machines in the case of operations such as divide.

One may conjecture that the original RISC aversion to floating point
was actually to microcode floating point:
when the most important floating point operations like [[FADD]] and [[FMUL]] became pipelined,
capable of being started one or more per clock
even though the operation took several cycles of latency,
the objections to FP on a RISC died away.

== Bad Effects ==

We are still paying the cost of certain RISC-era design decisions.

For example, Intel MMX is irregular.
It does not have all the reasonable combinations of
datasize={8,16,32},
unsaturated, signed and unsigned saturation.
This irregularity was NOT because of hardware complexity,
but because management was trying to follow RISC principles by counting instructions.
Even when providing all regular combinations would have made the hardware simpler rather than harder.
(Aside: validation complexity is often used as an argument here, against regularity. The complexity of validating all regular combinations grows combinatorically.)

AMD x86-64 is not a bad instruction set extension.
But life might be easier in the future, if x86 does not die away, if it had been more regular.
More RISC-like.
But in this way RISC was its own enemy:
RISC did not achieve the often hoped for great performance improvements over CISC.
RISC reduces complexity, which does not directly improve performance.
So people went chasing after a Holy Grail of VLIW performance, which also did not pan out.

= Conclusion =

I did not really want to write this article. There is not much to add to all that has been written about the so-called [[RISC Wars]] except to say
that they caused a lot of thrashing, amounted to less than promised, although they did bring some improvements.

However, I could not really write a wiki on computer architecture without mentioning [[RISC versus CISC]].'

I would prefer, however, to discuss interesting aspects of RISC computer architecture, that may not be discussed in many other places,
than to rehash this old debate:

* [[Breaking out of the 32-bit RISC instruction set straitjacket]]
** [[variable length RISC instruction sets]]
** [[16-bit RISC instruction sets]]
** [[RISC instruction sets with prefix instructions]]

* [[post-RISC proliferation of instructions]]

* [[How do you count instructions in an instruction set architecture?]]

* [[Microcoded instructions and RISC]]
* [[Hardwired instructions and RISC]]
* [[Hardware state-machine sequenced instructions and RISC]]

Many of these latter issues are of the form "XXX and RISC", and really should be only of the form "XXX",
except that one of the leftovers of the [[RISC era]] is that it is considered obligatory to explain
how a feature supports or opposes the [[[RISC philosophy]].

Wednesday, February 09, 2011

Non-speculative or less speculative versus more speculative

http://semipublic.comp-arch.net/wiki/Non-speculative_or_less_speculative_versus_more_speculative

When discussing [[speculative execution]] techniques such as [[speculative multithreading]] (or [[skipahead]], or [[slipahead]], or ... )
we often need to talk about less speculative and more speculative threads or instructions.

E.g. a later, younger, more speculative load instruction may be blocked by an earlier, older, less speculative store instruction to the same address.

E.g. in [[SpMT]] or [[SkMT]] a less speculative thread may fork a more speculative thread.

Often in early work we only talk about non-speculative threads forking,
and do not discuss the possibility of a speculative thread forking.
Late, it is realized that it is entiirely reasonable for speculative threads to fork.
Thus, often we need to read "non-speculative" in early work,
and implicitly substitute "less speculative".

E.g. a common [[speculative thread management policy]] is to always keep the least speculativethreads,
and cancel any more speculativethreads,
when there is an opportunity to spawn a less speculative thread.

---

Note that "less speculative" is not necessarily the same as "older".
E.g. in a [[fork-on-call]] thread, certain instructions may beindependent of the skipped code,
and be guaranteed to be executed;
whereas other instructions, inside a skipped function,
or older in the [[Von Neumann instruction sequence]],
but are actually more speculative, as in less likely to be executed.
However, it is often too hard to track this, so often discussions
will implicitly assume "less speculative" is "older".

---

Note that it is not always possible to determine the order of speculative threads or instructions.
E.g. one may fork loop bodies out of order, with no known order,
and later string them together as memory items iterated on by the loop are encountered.
However, arbitrarily imposing an order simplifies many speculative algorithms,
even at the cost of potential performance.

---

* See[[spawning versus forking a thread]].

Skipahead Multithreading

http://semipublic.comp-arch.net/wiki/SkMT

[[Skipahead multithreading (SkMT)]] is a form of [[speculative multithreading (SpMT)]]
characterized by "skipping ahead" at certain points in
[[non-speculative or less speculative versus more speculative|non-speculative or less speculative execution to more speculative execution]].

Typically these "certain points" in thecode are places
where there is a well characterized [[control independence or convergence]] point:
* the instruction after a CALL instruction
* end of loop
* later iterations of loop
* IF convergence

I, Andy Glew, coined the term [[SkMT]]
when it became evident that the term [[SpMT]],
which was itself coined by Antonio Gonzales and promoted by me,
was more generic.
I.e. you can imagine creating speculative threads
that do not really skip that far ahead,
but which, e.g. execute past a place where execution would bee blocked,
either an in-order blockage, or where an OOO window would be full.
See [[non-skipahead speculative multithreading]].

    Oooo.... I just created a new term: [[slip-ahead multithreading]]. It rather nicely encapsulates what I just described, and is consistent with published project names such as [[slipstream]].

In much the same way, I had earlier used the term [[implicit multithreading (IMT)]],
and replaced it by [[speculative multithreading (SpMT)]],
which I am now (after, 10 years ago, 2000) specializing to [[skipahead multithreading (SkMT)]].

The term [[skipahead]] is intended to be contrasted with [[lookahead]],
a term which was once used to characterize all [[out-of-order (OOO)]] execution.

    Look-ahead processors. Robert M. Keller, Princeton. ACM Computing Surveys, Vol 7, Issue 4, Dec 1975. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.5118&rep=rep1&type=pdf