Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Thursday, July 14, 2011

Things you can do with a double precision multiplier

Consider a processor that has hardware sufficient to do a double precision multiplication.
Or perhaps a multiply-add.

([[WLOG]] we will talk about floating point; similar discussion applies to 2X wide integer.)

A double precision multiplier overall has a multiplier capable of forming 4 single precision products.
Let us draw something like this, except using byte wide multipliers as subcomponents rather than single bit:
Compare the partial products array for 64x64 to 32x32:
XXXX/XXXX
X X/X X
X X/X X
XXXX/XXXX
----+-----
XXXX/XXXX
X X/X X
X X/X X
XXXX/XXXX

or briefly if the numbers are (Ahi+Blo)*(Xhi+Ylo)
AY BY
AX BX

the summation network is similarly larger, although the final [[CPA (Carry Propagate Adder)]] is "only" 2X wider
(more than 2X more gates, but only a bit deeper in logic depth).

Given such a double precision multiplier, we can synthesize several different types of single precision operations

* [[LINE]]: v=p*u+q
:: this is just an [[FMA]], with possibly a different arrangement of inputs
* [[PLANE]]: w=p*u+q*v+r
:: this has 2 multiplications, although the sum network must be adjusted to align the products differently. This can be achieved by shifting the input to the upper half of the multiplier array
* [[LRP]] or [[BLEND]]: w=u*x+v*(1-x)
:: This is like [[PLANE]], except the second multiplier part is calculated. Like 2X, etc. products for advanced [[Booth encoding]]?

The above uses the 4 multiplications of the double precision multiplier,
but only uses 2 of them.
We can be more aggressive, trying to use all 4 - but then the summation network needs considerable adjustment.

An arbitrary 2D outer product:
[[OUTER2]] = (a b) X (x y) =
ax ay
bx by
although this causes some difficulties because it needs to write back an output twice as wide as its inputs.

[[CMUL (Complex multiply)]]: can be achieved using this multiplier: (a+bi) X (x+yi) = ax-by + (ay+bx)i
although once again there are difficulties with alignment in the summation network.