The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, September 21, 2010


= Packed =

For want of a better term, I will use the phrase [[Packed Operations]] to refer to instruction set extensions such as Intel MMX, SSE, PowerPC Altivec, etc.:
typically operating on 64 or 128 bit data (although people did propose [[32 bit MMX]], and 256 bit AVX is coming soon in 2010).

Typically operating on
* 8x8b or 16x8b
* 4x16b or 8x16b
* 2x32b or 4x32b

Typically operating SIMD-style inside ALU operations,
but typically lacking scatter/gather memory operations.

= Packed vs Vector Lanes =

I want to contrast this with microarchitectures that have parallel vector lanes.
(NOT [[vector sequential]] microarchitectures.)

Packed Operations are fundamentally scalar pipelines. They just happen to have wide scalar datapaths, e.g. 64bits.
Since there are typically no 128 bit scalar datapaths, they have evolved beyond scalar;
but typically the datapath is designed as if it is a wide ALU datapath.

Parallel vector lane microarchitectures, on the other hand, are typically formed ourt of multiple indeoendent ALUs.

Packed ALUs are typically a single ALU, that is subdivided to perform independent packed operations.
E.g. one might take a 64-bit wide adder, and add special buffer bits at every byte boundary.
But controlling the values of these buffer bits, we can control whether carry will propagate across 8b, 16b, or 32b boundaries.
I.e. a 64 bit wide ALU with such byte boundary buffer bits
is really a 72 bit wide ALU,
capable of performing 8x8b, 4x16b, and 2x32b, as well as 64 bit addition.
(For that matter, odd sizes such as 24bit arithmetic can also be achieved.)

Similarly for 128 bit packed datapaths. Although nobody probably wants to build a 128 bit wide carry propagate adder.

Whereas in a vector lane microarchitecture, the ALUs are independent.
E.g, in 512 bit wide vector machine with 16 32b lanes,
there are 16 32b wide adders.
In such a machine crossing vector lanes is difficult.

Of course there is a spectrum of possibilities.
For example, pairs of even and odd vector lanes may be combined to perform double precision arithmetic.
Or, one may have vector lanes, where the operations within the lanes are 128b packed operations.

(128b packed is particularly pleasant as a lane width, since it supports the common 4x32b vectors, as well as 2x64b complex.)

Even on a packed datapath, certain operations that move data between bits are difficult.
E.g. shifts require special handling of buffer bits.
It is more difficult to partition a 64x64bit integer multiplier into 32b*32b multipliers,
than it is to simply add buffer bits for add - nevertheless, it can be done.

(Note that a 64b*64b integer multiplier can be partitioned into 4 32b*32b multipliers - roughly what is required for complex multiplication or an FFT 2-butterfly.)

Floating point is even more difficult to partition. The multiplier array for an FMA can be shared between 32b and 64b.
The adder logic is more challenging.

Neverthelerss: considerable sharing is possiblwe on a packed datapath.
But at some point one graduates to separate ALUs for vector lanes.

In "separate ALUs", the operative term is "separated". Physically separated ALUs are part of a vector lane microarchitecture.

= Instruction Set Design Consequences =

In a packed architecture, one tends to have N-bits, subdivided into more or viewer smaller pieces.

In vector architectures, one tends to have vector elements of a given width, say VL * EW, where EW is the element width.

One might then have narrower operations that apply to vector elements,
e.g. a vector of VL 32b floats or a vector of VL 64b floats.

(Although one might have packed operations on the lane width vector elements.)

There may be operations that load a vector of 8, 16, or 32 bit elements,
packed in memory, into unpacked 64 bit elements within a vector register.
Whereas on a packed instruction set, one probably just loads the values without unpacking them.

Consider, e.g. 16b wide floating point.
Many machines expand it into 32b floating point in registers.
But one can imagine ALUs that operate on 16b FP in registers.

I.e. packed operations are about using all of the bits of the register datapath.
Whereas vector lane operations are about vector operations on a limited number of datatypes.

No comments: