Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, February 21, 2011

Timing Invariant Branch Prediction

http://semipublic.comp-arch.net/wiki/Timing_Invariant_Branch_Prediction
{{Terminology Term}}
[[Category:Branch Prediction]]

By [[Timing Invariant Branch Prediction]] I mean branch prediction that is independent of timing.

E.g. a branch predictor design that, if you slowed down the clock or changed the pipeline, e.g. by inserting idle pipestages,
would not vary.

Such invariance is very convenient
(1) for validation
(2) for separation of concerns - it allows you to change the pipeline without worrying about its effect on branch prediction accuracy, etc.

However, many microarchitectures do not adhere to this design principle.

Tweaks and adjustments to the branch prediction microarchitecture may be necessary to attain [[Timing Invariant Branch Prediction]].
For example
* Per-branch history often leads to timing dependent branch prediction, which can be remedied by [[path history]]
* Updating the branch history tables can lead to timing dependent branch prediction

= Per-branch history often leads to timing dependent branch prediction =

For example, if you are using a [[per-branch history]] such as a [[TNT branch history]],
in many pipelines there are several clocks between instruction fetch,
i.e. delivering an instruction pointer to the instruction cache and branch predictor,
and instruction decode.
This means that an instruction fetch/branch prediction time you do not know where the branch instructions are;
you only know where the branch instructions are at decode time.

Therefore, unless you stall,
branch prediction may be using a [[per-branch history]] that is several cycles out of date,
and which may miss several branches
from what it would be ideally using if branches could be identified instantaneously.
Moreover, how many branches are missing may depend on timing, such as [[I$]] misses, pipeline stalls, etc.

== Path history enables timing independent branch prediction ==

If it is a problem to have timing dependent branch prediction
caused by per-branch history, this can be assuaged by [[path history]] branch prediction.

Instruction fetch does not necessarily know where the branches are. However, it necessarily does know the sequence of instruction fetch addresses.
If it is possible to create a path history suitable for use in a branch predictor,
e.g. by XORing the instruction fetch pointers,
then this [[path history]] is accurate and timing invariant.
Since XOR hashing is fast, this probably can be acheived.

However, XORing all fetch IPs may not be the best [[path history]] to use in branch prediction.
Creating a hash, probably XOR based, of [[branch from/to addresses]],
suffices to describe the path - although it losses information about branches between instruction fetch blocks.
Hashing such a [[branch from/to addresses]] [[path history]] with the current instruction fetch pointer
is about as good as you can do at instruction fetch,
without identifying individual instructions.

== Combining ... ==

[[Path history]] based branch predictors are usually reported as being more accurate than [[per-branch history TNT]],
so the pipeline adjustments above may help performance as well as providing [[Timing Invariant Branch Prediction]].

However, if they do not, you can obtain a hybrid that provides many of the benefits of [[timing invariant branch prediction]]
along with the possible improved accuracy of [[per-branch history]]:
* use [[path history]] at instruction fetch
* use [[per-branch history]] at the decoder, in a form of [[late-pipestage branch prediction]]

This gives you timing invariance,
but it also gives the [[decoder branch predictor]] the chance to make corrections to the earlier branch prediction.

= Branch History Update Time =

Q: when should the prediction tables, the [[pattern history table (PHT)]], also sometimes called the [[branch histogram]] or [[branch history table (BHT)]],
be updated?

At execution time, or at retirement time.

Updating at retirement time enables [[Timing Invariant Branch Prediction]]

Updating at execution time may cause only minor issues on an in-order machine.
On an out-of-order machine, however, updates may be done out of order.
In either case they may be done speculatively,
for branches that will not actually be retired because of earlier misspeculations.

Furthermore there arises the question of what history or [[stew]] is used to update the pattern history table.
If every branch at execution carries its history or stew with it, no problem.
But if a big complicated history is maintained only at retirement,
some processor designs have updated the pattern table for branches at execution
with a history corresponding to a position in the instruction stream several cycles before the branch.
Not necessarily a consistent number of cycles, either.

Updating the prediction tables a tretirement time seems to avoid these issues.

TBD: performance cost

TBD: latencies of table update - immaterial if [[pattern history invariant branch prediction]].

= Conclusion =

Is [[Timing Invariant Branch Prediction]] an absolutely vital design goal?

Not necessarily - if performance is increased by timing variant branch prediction, so be it.

However, [[Timing Invariant Branch Prediction]] is definitely a nice thing to have:
* it makes validation much easier
* it makes the design more robust, less fragile, less likely to break if you have to add a pipestage or stall late in the design cycle.
* it is usually associated with higher performance rather than lower performance branch prediction algorithms
* it can usually be achieved by a hybrid predictor design.

It is my experience that most timing dependent branch predictors
happened by accident, rather than by design:
* naive designers building a [[per-branch history]] out of a textbook
* naive extension of in-order designs to out-of-order exacerbating unnecessary timing dependence
etc.

[[Timing Invariant Branch Prediction]] is not necessarily a must-have,
but it is always a good thing to keep in mind when designing
your branch predictor and your microarchitecture/pipeline.
It is a pity to lose its benefits due to ignorance rather than deliberation.

No comments: