Register Renamer Port Reduction - CompArch: "Register Renamer Port Reduction"
On comp.arch circa August 2010, Mitch Alsup a little while back said something like "It isn't the ports on the register file that are a problem, it's the ports on the renamer." With, IIRC, a comment that you had to squeeze the renamer into a single pipestage.
Not true. This reminds me of a comment by Tim Olson, then of AMD (and active on comp.arch), when I presented HaRRM, my Hardware Register Renaming Mechanism (which is basically the modern form of renaming) to him. He said that it would lose because it required an extra pipestage.
First, renamer ports: the renamer is a much smaller structure than the PRF, in terms of number of entries times number of bits per entry. This means that the wires involved are shorter, even though the logic depth, Mitch's preferred metric in most of his posts, is similar. Wires are a major contributor to delay.
Second, "port reduction" techniques can be applied to the renamer just as they can be applied to the register file. True, out-of-order execution may allow the execution units to pick up a higher fraction of their inputs on bypasses, whereas the renamer is essentially an in-order pipestage, and in-order means less port reduction. Nevertheless, there's a lot to latch on to.
= Components of Register Renaming =
Register renaming has three main parts:
a) bypassing: in a block of instructions, comparing the output registers of older instructions to the input registers of younger instructions
b) lookup: if a register is live-in to a block, i.e. if it is not bypassed, then looking it up in the renamer/map table.
c) allocation: if a register is written within the block, giving it a new physical register (and updating the array); also, arranging for this newly allocated register to be used as inputs by younger instructions in the block
= Lookup =
Lookup is the part that most people think of as accessing a highly ported array. E.g. if you are renaming 4 instructions per cycle, with 2 inputs, you might be looking up 8 inputs per cycle, and writing 4 new outputs.
But observe that whenever there is bypassing you don't need lookups.
For example, consider the code sequence
ADD eax += M[ecx+edx]
tmp := load(ecx+edx)
eax := eax + tmp
CMP eax,ebx
JNE target
The above group of 4 uops only has 4 live-ins, not 8. Only 4 values need to be looked up in the RAT array. The remaining 3 input values (tmp, eax in the CMP, JNE) are produced by earlier uops in the same block.
True, P6 brute forced this: we renamed three uops per cycle, we always looked up 6, one for each register operand - and we threw away any lookup that was bypassed within the block. This avoided the need to first do the bypassing, and then
However, even before P6, when I was working on this form of register renaming at UIUC during my MSEE, I anticipated ways to avoid this. Of course, that was back when I was trying to create 16-wide superscalar machines, so reducing complexity was highly required.
== Non-Blocking Limited Port Register Renaming ==
IMHO the most elegant way is similar to what is also IMHO the most elegant way to do RF read port reduction:
1) do the bypass comparisons of uop output to uop input earlier, possibly as a separate pipestage.
2) out of this previous work, generate a list of live-ins that need to be looked up. Have a limited number of lookup ports. Of course use port combining, so multiple reads of the same register live-in only get looked up once - although possibly only after some time.
The question is, what happens when you run out of lookup ports? You could do an in-order stall. But, being me, I like to let later instructions that are not dependent on one of the blocked lookups get renamed and proceed to the scheduler. I.e. I want to do out of order renaming. Actually, I think the easiest way to do this is to let the guys blocked on a renamer lookup port proceed to the scheduler, but to have a separate write port that fills in the physical register input later. This is not actually out-of-order: all uops are renamed in-order, but some are immediately renamed to physical registers, while others are renamed to placeholders that later get filled in with physical register numbers. I.e. the renaming is in-order, but the renaming all the way to physical registers may be out of order. (In my most BS-full moods I have imagined that the renaming is actually done out of an OOO scheduler, since even the bypass comparisons might be done with a non-full network, as will be discussed next.)
= Caching Renamed Instructions =
And, oh yes: it is trivial to imagine caching the renames, the results of the bypass comparisons, in a decoded instruction cache / trace cache. That is one of the reasons I worked on trace cache, way back then. You get blocks of instructions - possibly fixed size blocks, possibly variable length traces - with N live-ins that need to be looked up, and M internally bypassed values that are given relative numbers, i.e. "use the K-th register that is allocated to this block."
= Bypassing Renames Inside Instruction Blocks =
Now, how about the bypassing? Apart from the
caching thereof, mentioned above.
Naively, that's O(N^2) comparators:
1st uop
- lookup all of its inputs (some combining)
- compare its output to all N-1 younger uop's inputs
2nd up
- if its inputs are not bypassed from uop #1
lookup all of its inputs (some combining)
- compare its output to all N-2 younger uop's inputs
etc.
Dana Henry and Brad Kuzsmaul's Ultrascalar showed that you can do this in O(lg N) work. But that's asymptotic, for really big machines.
== Per Logical Register Renaming Circuit ==
Here's another way to do bypassing, suitable for machines with small register sets:
* create a "carry chain" for each logical register
* 2 bits, states "not used", "written", "needs to be looked up"
* insert not used at the oldest instruction
* pull down whenever written by a uop, in the sense
older next
nu nu => nu
lu * => lu
w * => w
This is O(N) delay, O(N*L) hardware, where N= number of uops, L=number of logical registers. It suffices to tell you what registers are live-in, and need to be looked up.
If you want, you can make the values inserted be "written by uop #k",
and thereby establish the dataflow. Increases the hardware cost by a factor, i.e. multiplying, by *log2(N) bits. You take those sequence numbers, and lookup a queue of free registers, and/or add to a base (the latter if doing a P6-style circular allocation in a ROB).
(I've heard that some companies did freelists as CAMs. Enough so that they veered away from freelists in subsequent processors. Silly people.)
Of course, O(N*L) is not a win over O(N^2) unless L is small. Which it might have been, back when I imagined doing this for a 16-wide x86. But which is no longer, now that N seems to be stuck around 4, and L is 32 or more.
== Incomplete Comparisons ==
Note that you can also do an incomplete comparison: e.g. compare most uops to their immediate predecessor, which catches many dependencies. Queue up requests to do comparisons that are further apart on a smaller set of comparators.
Disclaimer
The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.
See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.
See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.
Tuesday, September 07, 2010
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment