Consider a processor that has hardware sufficient to do a double precision multiplication.
Or perhaps a multiply-add.
([[WLOG]] we will talk about floating point; similar discussion applies to 2X wide integer.)
A double precision multiplier overall has a multiplier capable of forming 4 single precision products.
Let us draw something like this, except using byte wide multipliers as subcomponents rather than single bit:
Compare the partial products array for 64x64 to 32x32:
        XXXX/XXXX
       X  X/X  X
      X  X/X  X
     XXXX/XXXX
    ----+-----
    XXXX/XXXX
   X  X/X  X
  X  X/X  X
 XXXX/XXXX
or briefly if the numbers are (Ahi+Blo)*(Xhi+Ylo)
   AY BY
  AX BX
the summation network is similarly larger, although the final [[CPA (Carry Propagate Adder)]] is "only" 2X wider
(more than 2X more gates, but only a bit deeper in logic depth).
Given such a double precision multiplier, we can synthesize several different types of single precision operations
* [[LINE]]: v=p*u+q
:: this is just an [[FMA]], with possibly a different arrangement of inputs
* [[PLANE]]: w=p*u+q*v+r
:: this has 2 multiplications, although the sum network must be adjusted to align the products differently. This can be achieved by shifting the input to the upper half of the multiplier array
* [[LRP]] or [[BLEND]]: w=u*x+v*(1-x)
:: This is like [[PLANE]], except the second multiplier part is calculated. Like 2X, etc. products for advanced [[Booth encoding]]?
The above uses the 4 multiplications of the double precision multiplier,
but only uses 2 of them.
We can be more aggressive, trying to use all 4 - but then the summation network needs considerable adjustment.
An arbitrary 2D outer product:
 [[OUTER2]] = (a b) X (x y) =
     ax ay
     bx by
although this causes some difficulties because it needs to write back an output twice as wide as its inputs.
[[CMUL (Complex multiply)]]: can be achieved using this multiplier: (a+bi) X (x+yi) = ax-by  + (ay+bx)i
although once again there are difficulties with alignment in the summation network.