Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, February 06, 2011

Managing branch prediction history: copying short history versus pointing to large history

http://semipublic.comp-arch.net/wiki/Managing_branch_prediction_history:_copying_short_history_versus_pointing_to_large_history

When [[branch prediction history]] was short, 8-12 bits, and global,
then it was not unreasonable to manage the history by copying it.

E.g. in one simulator I actually arranged so that a branch [[uop]]
wrote  back, as its result
* an indication of whether it was mispredicted or not
* the taken [[target IP]]
* the branch predictor history to be restored on a branch misprediction.

I.e. the branch prediction history was propagated, in this simulator, from the [[instruction fetch front end]]
across the scheduler to execution, and back again.

While simple, this involves a lot of unnecessary data movement - both for the history, but also for the [[target IP]].
Most machines of my acquaintance create a [[branch information table (BIT)]], holding information for branches in flight.
This avoids copying the history from the front end to execution and back again,
but nevertheless may involve making copies of  the history.


    I confess an ulterior motive for copying the history and other branch information around: this naturally leads to the number of branches in flight scaling with the window size. Many early OOO designs were crippled by supported too few branches in flight.


Making copies of the history seems silly if, as in a TNT history, they differ only by 2 bits:
 new_history := (old_history << 1) | new_branch_taken_or_not


    (Note: when shifting a branch history, we will often say h<


But making copies seems to be required if we want to be able to restore to any mispredicted branch point, i.e. if we want to do [[instantaneous versus incremental branch misprediction repair]].

Making copies works well enough if the branch prediction history is small enough - 8 bits, etc.
But by the time we are talking about 16 bit histories and 32 or 64 bit IPs, we are talking about quite a few bits.

Furthermore, in the  late 1990s and early 2000s branch predictors arose with  hitherto unseen long histories,
such as Seznec's OGEHL predictor (with multiple history lengths 9, 2, 4, 8, 16, 32, 64, 128, ...).
Copying around a 128 bit history, even to a [[BIT]], is wasteful;
copying around the even larger 1000+ bit histories that have been proposed is even worse.

Hence the interest in pointing to a position in a branch prediction history,
rather than copying the entire history.
On a branch misprediction one would restore the pointer, rather than overwriting the  history with a savedcopy.

This would  be straightforward for a long [[TNT]] history, since only 1 bit depends on any branch.
You would simply keep a history of total length TL=PHL+BIF, the sum of predictor history length plus branches in flight.
On a misprediction you would restore the pointer in this circular buffer.
(Or equivalently shift the buffer- I suspect that shifting is too power hungry.)

This could scale to almost any length of history, potentially thousands of bits.

Unfortunately, modern [[stew]] histories are more complicated than [[TNT]] histories.
The [[branch IP]], or even both [[from IP]] and [[to IP]], may be [[hashed, e.g. XORed]] into the stew.
This means that several of the youngest bits in the history may change on every branch.
Simply restoring a pointer will not suffice.

TBD: explain stew management in more detail.

Simple strategy:  constrain the stew to have only the N youngest bits affected by the most recent branch.  Bits TL..N are unaffected  by the most recent branch, except for shifting.
One can then keep a copy of the parts of the history affected by recent branches, the N youngest bits, and a pointer that locates the older bits.

= See Also =

* [[How to use a really long predictor history]]

Branch Prediction Stew

{{Terminology Term}}
http://semipublic.comp-arch.net/wiki/Stew

The [[stew]] is a form of history used by certain branch predictors.

See, for example, US patent 7143273,
Method and apparatus for dynamic branch prediction utilizing multiple stew algorithms for indexing a global history,
Mile, Slade, and Jourdan,
filed March 31, 2003,
assignee Intel.

A simple [[branch predictor history]] might be  a simple [[TNT]] history,
with 0s corresponding to non-taken and 1s corresponding to taken.
Such a simple TNT history cannot distinguish some convergeing paths,
and indirect branches.

US7143273
describes one embodiment of a stew as
 stew = ((stew << 1)|new_bit ^  ip)

A stew formed  in this way can distinguish converging paths such as  ("if true" means a condition that evaluates as true, not an unconditional branch):
   L1: if true got L2
   L2: if true goto L99
   L10: if true goto L11
   L11: if true  goto L
   L:  if ?? goto L99

However, it does not distinguish multiple  branch targets and paths out of an indirect branch, such as
   L1: Reg:= IL1; if true goto L2
   L2: if true goto L99
   L10: Reg:=IL2; if true goto L11
   L11: if true  goto L
   L:  if ?? goto [Reg]
   IL1: ...
   IL2: ...
It can be seen that mixing in arc information as well as node information remedies this situation,
and distinguishes different paths so long as the hashes do not collide:

 stew <<= number_of_bits_to_discard
 stew = hash( stew, from_IP, to_IP, taken/not_taken, ...)

Issue: how many bits to use?  Which may vary as a function of the type of branch: e.g. a direct conditional branch
may not need as many to_ip bits to be hashed in
as a completely random indirect branch.
Similarly, indirect calls and returns may be handled separately.

(TBD-IP)