## Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

## Monday, October 18, 2010

### Inner Product Instructions

http://semipublic.comp-arch.net/wiki/index.php?title=Inner_product_instructions&action=edit

The mathematical inner product or dot product is often desired in vector instruction sets. However, it poses some challenges for the instruction set designer.

= Integer Dot Product =

For example, consider an integer inner product. E.g. for 4 inputs
A0*B0 + A1*B1 + A2*B2 + A3*B3

If the arithmetic is 2's compliment, signed or unsigned, without overflow, this can be evaluated in any order. Indeed, one need not perform carry propagating adds for the products; we can combine the sum of the products with the sums involved in multiplication, in a redundant, carry-free, form.

However, if we are to detect overflow, or equivalently perform saturation, issues are raised. How are the calculations associated:
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))

Should we detect intermediate overflow, or should we ignore overflows that are "corrected" by subsequently adding in numbers of opposite sign. I.e. should we perform calculations as if using extended precision intermediate accumulators or temporary values.

If saturating, e.g. if using leftmost association, when we saturate for overflow should we stay saturated, or should we allow the saturated value to be brought back into range? Etc.

All of these boundary cases are annoying.

Glew recommendation: for small vectors, either don't do overflow detection, or calculate as if in extended precision and detect at the overflow at the end.
* Pro: easy to describe and understand
* Con: gives different results compared to doing scalar calculations in straightforward loops (with overflow detection)

Consider calculations as if in a specific order, typically leftmost or tree.

A hybrid approach is possible: calculate in the fastest possible way, e.g. tree, but then have a slower calculation detecting overflow. CON: extra power.

= Floating Point Dot Product =

Floating point dot product instructions have the same issues of order dependence as described for integer dot product with overflow detection - with the additional issue of rounding. Even in the absence of overflow,
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))

As before, some instruction sets define particular orders of evaluation, while some leave it undefined. Some also define an "infinite intermediate precision" form of inner product, where the sums are formed without rounding, and only rounded (and overflow detected) at the end.

[[Superaccumulators]] are a more general form of such arithmetic.

In the absence of such special support, floating point dot product instructions
with linear, e.g. leftmost, associativity are little faster than a succession of [[FMA]] or [[FMAC]] instructions.

= Alternative Inner Product Support =

Alternative instructions that can be used to synthesize dot products and other
possibly order dependent [[vector reductions]]
may be motivated by the issues discussed above.

For example, horizontal or pairwise add instructions, such as Intel SSE3's PHADDW:
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
1 packed vector output
A0+A1, A2+A3 ..., B0+B1, B2+B3, ...

Pairwise add could, of course, be done on a single vector operand, but then the
output would only be half-length.
Combining two vector operands in this manner produces a full-length vector output,
which is convenient.

Wiring-wise, if the A and B vector operands are interleaved
to allow [[vertical add]], this only requires wires sufficient to
route A1 to A0, and vece versa.

Note that horizontal add is "only" a combination of a vector interleave or shuffle operation,
and a vertical add. The interleave operation, however, wants to produce two vector outputs,
or else occupy two instructions (or produce a double width vector operand, or pack two singles into
double precision vector elements (but then what do you do for double precision inputs), or ...)
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
2 packed vector outputs
A0, B0, A2, B2 ...
A1, B1, A3, B3 ...
A0+A1, B0+B1, ...

Vertical, element by element, multiplication,
combined with sufficient horizontal adds to perform the sum,
builds a dot product.

Or, a single instruction may combine vertical multiplication and horizontal pairwise addition:
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
1 result
A0*B0+A1*B1, A2*B2+A3*B3, ...

The putative problem of having only half as many vector elements in the PMADDUBSW output is
handled by doubling the precision,
taking 8 bits as input and producing a packed 16 bit vector as output.
Unfortunately, that technique does not work so well for floating point.

= Vector Parallel versus Vector Sequential =

Defining dot product with leftmost associativity
for purposes of overflow and rounding,
is well suited to a [[vector sequential]] implementation.

[[Vector parallel]] implementations are better suited to horizontal adds,
and other parallel orders of summation.

Modern instruction sets such as SSE tend to have [[vector parallel]] implementations.

It is possible to mix vector parallel and vector sequential,
although a latency penalty is paid when transitioning from vector sequential to vector parallel.

Thus we begin sliding down [[The Slippery Slope of Vector Instruction Sets]]:
there are so many different possible definitions of the vector instructions,
with different evaluation orders,
suited to different microarchitectures.
It is easy to define the most basic vector instructions.
Completing the coverage of vector instructions is more challenging.