Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, October 18, 2010

Inner Product Instructions

http://semipublic.comp-arch.net/wiki/index.php?title=Inner_product_instructions&action=edit


The mathematical inner product or dot product is often desired in vector instruction sets. However, it poses some challenges for the instruction set designer.

= Integer Dot Product =

For example, consider an integer inner product. E.g. for 4 inputs
A0*B0 + A1*B1 + A2*B2 + A3*B3

If the arithmetic is 2's compliment, signed or unsigned, without overflow, this can be evaluated in any order. Indeed, one need not perform carry propagating adds for the products; we can combine the sum of the products with the sums involved in multiplication, in a redundant, carry-free, form.

However, if we are to detect overflow, or equivalently perform saturation, issues are raised. How are the calculations associated:
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))

Should we detect intermediate overflow, or should we ignore overflows that are "corrected" by subsequently adding in numbers of opposite sign. I.e. should we perform calculations as if using extended precision intermediate accumulators or temporary values.

If saturating, e.g. if using leftmost association, when we saturate for overflow should we stay saturated, or should we allow the saturated value to be brought back into range? Etc.

All of these boundary cases are annoying.

Glew recommendation: for small vectors, either don't do overflow detection, or calculate as if in extended precision and detect at the overflow at the end.
* Pro: easy to describe and understand
* Con: gives different results compared to doing scalar calculations in straightforward loops (with overflow detection)

Consider calculations as if in a specific order, typically leftmost or tree.

A hybrid approach is possible: calculate in the fastest possible way, e.g. tree, but then have a slower calculation detecting overflow. CON: extra power.

= Floating Point Dot Product =

Floating point dot product instructions have the same issues of order dependence as described for integer dot product with overflow detection - with the additional issue of rounding. Even in the absence of overflow,
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))
may give different answers.

As before, some instruction sets define particular orders of evaluation, while some leave it undefined. Some also define an "infinite intermediate precision" form of inner product, where the sums are formed without rounding, and only rounded (and overflow detected) at the end.

[[Superaccumulators]] are a more general form of such arithmetic.

In the absence of such special support, floating point dot product instructions
with linear, e.g. leftmost, associativity are little faster than a succession of [[FMA]] or [[FMAC]] instructions.

= Alternative Inner Product Support =

Alternative instructions that can be used to synthesize dot products and other
possibly order dependent [[vector reductions]]
may be motivated by the issues discussed above.

For example, horizontal or pairwise add instructions, such as Intel SSE3's PHADDW:
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
1 packed vector output
A0+A1, A2+A3 ..., B0+B1, B2+B3, ...

Pairwise add could, of course, be done on a single vector operand, but then the
output would only be half-length.
Combining two vector operands in this manner produces a full-length vector output,
which is convenient.

Wiring-wise, if the A and B vector operands are interleaved
to allow [[vertical add]], this only requires wires sufficient to
route A1 to A0, and vece versa.

Note that horizontal add is "only" a combination of a vector interleave or shuffle operation,
and a vertical add. The interleave operation, however, wants to produce two vector outputs,
or else occupy two instructions (or produce a double width vector operand, or pack two singles into
double precision vector elements (but then what do you do for double precision inputs), or ...)
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
2 packed vector outputs
A0, B0, A2, B2 ...
A1, B1, A3, B3 ...
Vertical add
A0+A1, B0+B1, ...

Vertical, element by element, multiplication,
combined with sufficient horizontal adds to perform the sum,
builds a dot product.

Or, a single instruction may combine vertical multiplication and horizontal pairwise addition:
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
1 result
A0*B0+A1*B1, A2*B2+A3*B3, ...

E.g. Intel SSE3's PMADDUBSW.
The putative problem of having only half as many vector elements in the PMADDUBSW output is
handled by doubling the precision,
taking 8 bits as input and producing a packed 16 bit vector as output.
Unfortunately, that technique does not work so well for floating point.

= Vector Parallel versus Vector Sequential =

Defining dot product with leftmost associativity
for purposes of overflow and rounding,
is well suited to a [[vector sequential]] implementation.

[[Vector parallel]] implementations are better suited to horizontal adds,
and other parallel orders of summation.

Modern instruction sets such as SSE tend to have [[vector parallel]] implementations.

It is possible to mix vector parallel and vector sequential,
although a latency penalty is paid when transitioning from vector sequential to vector parallel.

Thus we begin sliding down [[The Slippery Slope of Vector Instruction Sets]]:
there are so many different possible definitions of the vector instructions,
with different evaluation orders,
suited to different microarchitectures.
It is easy to define the most basic vector instructions.
Completing the coverage of vector instructions is more challenging.

WebDAV for my wiki?

I want to have drawings, etc., for my http://comp-arch.net wiki.

But I have thrashed on how to get them. I've tried too many different drawing tools for wiki - all have weaknesses.

I am now leaning towards WebDAV. Indeed, I am leaning towards rebuilding my wiki around WebDAV.

WebDAV basically is a filesystem implemented across web protocols.. If you are lucky enough, your favorite drawing program knows how to access a file stored on WebDAV. Or, you may have a tool that allows you to mount WebDAV as a filesystem on Windows or Linux or whatever, allowing almost any tools to access it.

The problem with WebDAV is that it IS a filesystem Just exporting a filesystem across the web is not very interesting. Listings of files in whatever tools you want - PowerPoint, Visio, Inkscape, etc.

But, I think if I write my own WebDAV server, combined with a wiki, it can be better than this. When a file is stored to WebDAV, I can arrange to convert it (if I understand its format) into a format that can be directly viewed by most, if not all, web browsers. E.g. GIF, although I am more hopeful about a vector forma like SVG. Plus, I can provide a link to the original format, .PPT or .VIS or whatever.

Not just for drawings.

Wiki text might be edited by a wiki editor, or potentially accessed by WebDAV and edited in your favorite text editor. My favorite would be EMACS. In org mode?

Heck, Word and PowerPoint and PDF and PS coud be uploaded. Although I prefer not. Viewed in the browser, if known how.

---

I was planning on a WebDAV server with git or hg or bzr as the versioning backend.

All I really want from WebDAV is the ability to click on an edit link and, on your favorite OS, open whatever program is associated with that file's type.

E.g. for a .PPT open PowerPoint or OpenOffice; for a .VIS (or whatever Visio uses) open Visio (if there's any Open Source tool that understands Visio files, I want to hear about it). Ditto Inkscape. .TXT or .WIKI in Emacs or your favorite text editor.

And, similar, on saving the file, have my own hooks to translate into some other format.

I'll probably want to provide some canned, in a web page, wiki text editor. In fact, I probably want to discourage tesxt being written except in a standard wiki markup. (Although having more than one wiki markup has some attractions. E.g. because I crosspost between wki and this blog. At least I want restructured text, since I am in the habit of cross posting wiki pages and comp.arch on USEnet. RST is much more readable that standard mediawiki or twiki markup.)

And maybe even a canned, in a web page, drawing editor.

But, my motivation for WebDAV is that I haven't found a drawing editor that makes me happy. Why reinvent the wheel?

The main problem with existing drawing editors is that their drawings cannot be displayed inline in web pages. But when I realized I can set my own put-time hooks in my own WebDAV server that seems more doable.