Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Saturday, August 13, 2011

Register_File_Port_Reduction_using_Time-based_SIMD

http://semipublic.comp-arch.net/wiki/Register_File_Port_Reduction_using_Time-based_SIMD

= Background: 4-cycle GPU SIMD instruction groups =

The most popular GPUs circa 2010 - Nvidia, AMD/ATI, Intel Gen - share the followng characteristics:
* they have [[SIMD or SIMT]] [[coherent threading]]
* the SIMD is 16-way spatially:
** i.e. each SIMD engine is circa 16 [[lanes]] wide
** although the definition of the lane varies, e.g. from 32 bit wide scalar for Nvidia to 5-way VLIW for AD/ATI
* the SIMD is 4 way temporally
* i.e. every [[SIMD instruction group]], [[wavefront]] or [[warp]], occupies the 16 lanes for 4 cycles.

This wiki page discusses the temporal aspect,
at least one specific advantage of taking at least 4 cycles peer [[SIMD instruction group]].
This is a particular case of
[[register file port reduction]].

= [[Register File Port Reduction using Time-based SIMD]] =

For simplicity, let us ignore the possibility of [[spatial SIMD]].
Let us assume that we have only one ALU,
taking 2 inputs peer cycle and producing a single output per cycle.
(Or possibly 3 inputs, if we want to consider [[multiply-add]].)
(Or more different combinations of inputs and outputs, if we are particularly aggressive.)

I.e.
dest := opcode( src, src2, src3 )

On a conventional scalar microarchitecture, this would require 3 read ports and 1 write port on the register file per cycle.
(Let's forget the possibility of architecturally requiring the registers to belong to banks with non-interfering ports.)

Now, instead, let us imagine that we are dealing with a SIMD instruction group or wavefront, i=0,3.
And let us say that it is distributed over time, 4 successive clock cycles t0..t3

I.e.
t0: dest[0] := opcode( src[0], src2[0], src3[0] )
t1: dest[1] := opcode( src[1], src2[1], src3[1] )
t2: dest[2] := opcode( src[2], src2[2], src3[2] )
t3: dest[3] := opcode( src[3], src2[3], src3[3] )

Now, instead of reading 3 separate (e.g. 32) values per cycle, and writing a single such value every cycle,
we could make each separate access 4X larger, but only do one such 4X larger access every cycle:

I.e.
t-3: src3_4x := RF.read( src3 )[0:127]
t-2: src2_4x := RF.read( src2 )[0:127]
t-1: src1_4x := RF.read( src1 )[0:127]
t0: dest_4x[0:31] := opcode( src1_4x[0:31], src2_4x[0:31], src3_4x[0:31] )
t1: dest_4x[32:63] := opcode( src1_4x[32:63], src2_4x[32:63], src3_4x[32:63] )
t2: dest_4x[64:95] := opcode( src1_4x[64:95], src2_4x[64:95], src3_4x[64:95] )
t3: dest_4x[96:127] := opcode( src1_4x[96:127], src2_4x[96:127], src3_4x[96:127] )
t3+1: RF.write( dest )[0:127] := dest_4x[0:127]

(As is typical in these discussions, we are constrained by the lack of universally understood [[slicing notation]])

This shows that you can get away with a single 4X wider read/write port,
at the cost of muxing/delay elements:

The pipeline might look like

W0.write W1.exec_0
W2.read1* W1.exec_1
W2.read2* W1.exec_2
W2.read3* W1.exec_3
W1.write W2.exec_0*
W3.read1 W2.exec_1*
W3.read2 W2.exec_2*
W3.read3 W2.exec_3*
W2.write* W3.exec_0
... ...


although of course it can be extended to be deeper, tolerating more ALU latency, etc.

= 4X Wider versus 4X Time-skewed =

This register file port reduction can be obtained in at least two ways:

* 4x wider
** by performing 4X wider reads and writes
** in a single "cycle" of RF access
** to and from 4X wider temporary registers
** and then muxing 1/4 of those wider registers in any given cycle of execution

* 4x time skewing
** by having 4 register files
** each skewed from the others by 1 cycle
** e.g. providing the same register number to each, but delayed 1 cycle for each skewing

These are very similar.

The 4X wider approach has the advantage of being very easy to express in conventional synthesized logic,
even though it might be marginally more expensive in full custom logic.
As of 2011 time skewed register files are hard to express in RTL languages.

= Why 4X? =

4X is a convenient power of two,
and [[powers of 2 are convenient in computer architecture]].

4X matches 3 reads and 1 write
for a register file with a single read and write port in any cycle.
Which conveniently matches multiply add, A:=B*C+D, one of the most common operations in graphics
- and GPUs were one of the first places this occurred.

3X could be a possibility, if restricted to [[2-input operations]].
But I suspect that 4X is just plain nicer.

Larger than 4X also a possibility. But, again, 4X is the smallest convenient size
that reaps most of the benefits.

No comments: