The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Saturday, October 16, 2010

Why Vectors Increase Frequency of Misalignment

= RISC handling of misaligned memory operations =

TBD: make this into a topic page of its wown.

It was a truism during the so-called [[RISC revolution]] that support for misaligned memory operations was unnecessary.
In particular, the rather expensive [[memory alignment mux]] could be greatly reduced in cost,
and the complexity involved in handling memory operands that cross cacheline boundaries or page boundaries could be reduced.
Compilers could either arrange to use [[naturally aligned]] data,
or they could arrange to use instruction sequences such as that below for data that might not be aligned:

loading possible misaligned data, address in rA
rAlo := rA & ~(size-1)
rAHi := rAlo + size

rDlo := load( rAlo )
rDhi := load( rAhi )

rAlobits := rA & (size-1)
rAlobits := size - rAlobits
rAlobitsMask := (1< rAhibits := size - rAlobits
rAhbitsMask := (1<
rDlo := rDlo & rAlobitsMask
rDhi := rDhi & rAhibitsMask

rD := rDlo | rDhi

You should get the idea (even in the likely case that there are errors above).

Frequently the compiler can simplify this. It nearly always knows the size, which is nearly always a power of 2.
It often knows the alignment, e.g. of a field in a struct whose alignment is known.
But there will arise cases where these things are not known.

Since the general case of compiled code to perform such boundary croossing support is rather complex,
RISC machines often combine many of these operations into a pair of instructions

rDlo := load_lo.size( rA )
rDhi := load_hi.size( rA )
rD := rDlo | rDhi

The final OR instruction to merge the low and high parts may be eliminated, e.g. by [[writing into instead of onto]] the destination register,
or by having an implicit merge.

rD := load_lo.size( rA )
rD := load_hi.size( rA )

TBD: describe the several ways of doing this compaction.
Note that machines with support for misaligned menory operations often use similar approoaches,
except that the may first attempt an aligned load, for example, and then fall back to such a misaligned sequence.
(TBD: describe forms of misaligned support.)

= Why Vectors Increase Frequency of Misalignment =

Unfortunately, the advent of small vector or packed instruction sets,
such as Intel MMX, SSE, or PowerPC Altivec,
led to an increase in the frequency of misaligned memory operations.

Briefly: users of vectors often want to load subvectors or slices of other vectors.
E.g. given a big vector V[0...1023] in memory, the user may want to operate 4 elements at a time.
V[0-3] would be aligned,
but V[1-4] would be misaligned.

The problem of motion estimation in digital video is a very good example of this.

== Motion Estimation and Misaligned Memory References ==

Digital video operates on sequences of frames, each frame beiing a MxN matrix of pixels

P[0,0] ... P[0,M-1]
... ... ...
P[M-1,0] ... P[M-1,N-1]

where (M,N) may be values such as CIF (288,252). Much larger videos are now common.

Digital video is usually compressed.
One common form of compression is to find parts of the image that seem to have moved between frames,
and then transmit, rather than the entire region that has moved, a motion vector,
and some correction values.
Motion compensation is the process of reconstructing the pixels from this information;
motion estimation is the proceess of finding such motion vectors and correction values.

Motion estimation is much more computationally intensive than motion compensation.

Most motion estimation algorithms involve a search.
They typically start with a block of pixels in Frame1

P[R,C] ... P[0,C+3]
... ... ...
P[R+3,0] ... P[R+3,C+3]

and compare it, in succession, to offset frames in Frame2
P[R+dR,C+dC] ... P[0+dR,C+3+dC]
... ... ...
P[R+3+dR,0+dC] ... P[R+3+dR,C+3+dC]

Typically the comparisons are done a row at a time, and then summed over all rows in the blocks:

tmp1 := load( Frame1.P[R,C..C+3] )
tmp2 := load( Frame2.P[R+dR,C+dC..C+3+dC] )

The exact search patterns vary by algorithm, but the offset values (dR,dC) nearly always include distances such as 0, 1, 2... etc pixels.
In fact, they often include sub-pixel granularity, such as 1/2 and 1/4 pixels.

It can therefore be seen that motion estimation naturally involves vectors that are not [[naturally aligned]].
Although the pixels may be naturally aligned, a group of 4 pixels P[1..4] is not.

There have been various attempts to mitigate the alignment issues.
For example, for motion estimation some forms of the [[MPSAD]] instruction may load double width slices, and extract all possible subslices.
These are a special form of [[convolution instructions]] or [[moving window instructions]].

However, this is toothpaste: when one source of vector misalignment has been cured, another crops up somewhere else.