Somewhat bittersweet: am today compacting and throwing out my patent plaques. Circa 60-70 of them - eventually I started asking Intel *NOT* to give me any more plaques. They just take up too much space, are a pain to move. Need the bookshelf space.
At the moment I am unscrewing the metal engraving. I'll keep the metal with the actual patent engraved on it. Maybe one day I will actually mount them. But I am going to throw out, discard, recycle, the wooden boards. I wonder if my daughter's school woodshop would want them.
Maybe I'll use the metal for something. Melt it down.
Like beating swords into ploughshares?
Disclaimer
The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.
See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.
See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.
Saturday, October 30, 2010
Tuesday, October 19, 2010
Textual notation for indexed summation, etc.
I grew up with mathematical notations such as
Now, I really believe that our programming systems should support such mathematical systems.
But seeing as most don't, I often wrestle with textural notation.
Here's a pleasant textual notation I came up with today:
first, the usual sum or sigma:
sum for i from 1 to N of A[i]
to create a vector
v[i], i=1,n
::== vector for i from 1 to N of v[i]
combining
vector for j from 1 to N of
sum for i from 1 to N of A[j+i]*B[j+N-i]
other
integral for i from 1 to N of f(i)
union for i from 1 to N of f(i)
intersection for i from 1 to N of f(i)
product for i from 1 to N of f(i)
continued fraction for i from 1 to N of f(i)
::== 1/(1+1/(f(1)+1/(1+1/f(2) + ...
2D
array
for i from 1 to M,
for j from 1 to N
of A[i,j]
3D
array
for i from 1 to N,
for j from 1 to M,
for k from 1 to L
of A[i,j,k]
shaped
array
for i from 1 to N,
for j from i to 2*i,
for k from 1 to j
of A[i,j,k]
Now, I really believe that our programming systems should support such mathematical systems.
But seeing as most don't, I often wrestle with textural notation.
Here's a pleasant textual notation I came up with today:
first, the usual sum or sigma:
sum for i from 1 to N of A[i]
to create a vector
v[i], i=1,n
::== vector for i from 1 to N of v[i]
combining
vector for j from 1 to N of
sum for i from 1 to N of A[j+i]*B[j+N-i]
other
integral for i from 1 to N of f(i)
union for i from 1 to N of f(i)
intersection for i from 1 to N of f(i)
product for i from 1 to N of f(i)
continued fraction for i from 1 to N of f(i)
::== 1/(1+1/(f(1)+1/(1+1/f(2) + ...
2D
array
for i from 1 to M,
for j from 1 to N
of A[i,j]
3D
array
for i from 1 to N,
for j from 1 to M,
for k from 1 to L
of A[i,j,k]
shaped
array
for i from 1 to N,
for j from i to 2*i,
for k from 1 to j
of A[i,j,k]
Monday, October 18, 2010
Inner Product Instructions
http://semipublic.comp-arch.net/wiki/index.php?title=Inner_product_instructions&action=edit
The mathematical inner product or dot product is often desired in vector instruction sets. However, it poses some challenges for the instruction set designer.
= Integer Dot Product =
For example, consider an integer inner product. E.g. for 4 inputs
A0*B0 + A1*B1 + A2*B2 + A3*B3
If the arithmetic is 2's compliment, signed or unsigned, without overflow, this can be evaluated in any order. Indeed, one need not perform carry propagating adds for the products; we can combine the sum of the products with the sums involved in multiplication, in a redundant, carry-free, form.
However, if we are to detect overflow, or equivalently perform saturation, issues are raised. How are the calculations associated:
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))
Should we detect intermediate overflow, or should we ignore overflows that are "corrected" by subsequently adding in numbers of opposite sign. I.e. should we perform calculations as if using extended precision intermediate accumulators or temporary values.
If saturating, e.g. if using leftmost association, when we saturate for overflow should we stay saturated, or should we allow the saturated value to be brought back into range? Etc.
All of these boundary cases are annoying.
Glew recommendation: for small vectors, either don't do overflow detection, or calculate as if in extended precision and detect at the overflow at the end.
* Pro: easy to describe and understand
* Con: gives different results compared to doing scalar calculations in straightforward loops (with overflow detection)
Consider calculations as if in a specific order, typically leftmost or tree.
A hybrid approach is possible: calculate in the fastest possible way, e.g. tree, but then have a slower calculation detecting overflow. CON: extra power.
= Floating Point Dot Product =
Floating point dot product instructions have the same issues of order dependence as described for integer dot product with overflow detection - with the additional issue of rounding. Even in the absence of overflow,
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))
may give different answers.
As before, some instruction sets define particular orders of evaluation, while some leave it undefined. Some also define an "infinite intermediate precision" form of inner product, where the sums are formed without rounding, and only rounded (and overflow detected) at the end.
[[Superaccumulators]] are a more general form of such arithmetic.
In the absence of such special support, floating point dot product instructions
with linear, e.g. leftmost, associativity are little faster than a succession of [[FMA]] or [[FMAC]] instructions.
= Alternative Inner Product Support =
Alternative instructions that can be used to synthesize dot products and other
possibly order dependent [[vector reductions]]
may be motivated by the issues discussed above.
For example, horizontal or pairwise add instructions, such as Intel SSE3's PHADDW:
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
1 packed vector output
A0+A1, A2+A3 ..., B0+B1, B2+B3, ...
Pairwise add could, of course, be done on a single vector operand, but then the
output would only be half-length.
Combining two vector operands in this manner produces a full-length vector output,
which is convenient.
Wiring-wise, if the A and B vector operands are interleaved
to allow [[vertical add]], this only requires wires sufficient to
route A1 to A0, and vece versa.
Note that horizontal add is "only" a combination of a vector interleave or shuffle operation,
and a vertical add. The interleave operation, however, wants to produce two vector outputs,
or else occupy two instructions (or produce a double width vector operand, or pack two singles into
double precision vector elements (but then what do you do for double precision inputs), or ...)
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
2 packed vector outputs
A0, B0, A2, B2 ...
A1, B1, A3, B3 ...
Vertical add
A0+A1, B0+B1, ...
Vertical, element by element, multiplication,
combined with sufficient horizontal adds to perform the sum,
builds a dot product.
Or, a single instruction may combine vertical multiplication and horizontal pairwise addition:
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
1 result
A0*B0+A1*B1, A2*B2+A3*B3, ...
E.g. Intel SSE3's PMADDUBSW.
The putative problem of having only half as many vector elements in the PMADDUBSW output is
handled by doubling the precision,
taking 8 bits as input and producing a packed 16 bit vector as output.
Unfortunately, that technique does not work so well for floating point.
= Vector Parallel versus Vector Sequential =
Defining dot product with leftmost associativity
for purposes of overflow and rounding,
is well suited to a [[vector sequential]] implementation.
[[Vector parallel]] implementations are better suited to horizontal adds,
and other parallel orders of summation.
Modern instruction sets such as SSE tend to have [[vector parallel]] implementations.
It is possible to mix vector parallel and vector sequential,
although a latency penalty is paid when transitioning from vector sequential to vector parallel.
Thus we begin sliding down [[The Slippery Slope of Vector Instruction Sets]]:
there are so many different possible definitions of the vector instructions,
with different evaluation orders,
suited to different microarchitectures.
It is easy to define the most basic vector instructions.
Completing the coverage of vector instructions is more challenging.
The mathematical inner product or dot product is often desired in vector instruction sets. However, it poses some challenges for the instruction set designer.
= Integer Dot Product =
For example, consider an integer inner product. E.g. for 4 inputs
A0*B0 + A1*B1 + A2*B2 + A3*B3
If the arithmetic is 2's compliment, signed or unsigned, without overflow, this can be evaluated in any order. Indeed, one need not perform carry propagating adds for the products; we can combine the sum of the products with the sums involved in multiplication, in a redundant, carry-free, form.
However, if we are to detect overflow, or equivalently perform saturation, issues are raised. How are the calculations associated:
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))
Should we detect intermediate overflow, or should we ignore overflows that are "corrected" by subsequently adding in numbers of opposite sign. I.e. should we perform calculations as if using extended precision intermediate accumulators or temporary values.
If saturating, e.g. if using leftmost association, when we saturate for overflow should we stay saturated, or should we allow the saturated value to be brought back into range? Etc.
All of these boundary cases are annoying.
Glew recommendation: for small vectors, either don't do overflow detection, or calculate as if in extended precision and detect at the overflow at the end.
* Pro: easy to describe and understand
* Con: gives different results compared to doing scalar calculations in straightforward loops (with overflow detection)
Consider calculations as if in a specific order, typically leftmost or tree.
A hybrid approach is possible: calculate in the fastest possible way, e.g. tree, but then have a slower calculation detecting overflow. CON: extra power.
= Floating Point Dot Product =
Floating point dot product instructions have the same issues of order dependence as described for integer dot product with overflow detection - with the additional issue of rounding. Even in the absence of overflow,
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))
may give different answers.
As before, some instruction sets define particular orders of evaluation, while some leave it undefined. Some also define an "infinite intermediate precision" form of inner product, where the sums are formed without rounding, and only rounded (and overflow detected) at the end.
[[Superaccumulators]] are a more general form of such arithmetic.
In the absence of such special support, floating point dot product instructions
with linear, e.g. leftmost, associativity are little faster than a succession of [[FMA]] or [[FMAC]] instructions.
= Alternative Inner Product Support =
Alternative instructions that can be used to synthesize dot products and other
possibly order dependent [[vector reductions]]
may be motivated by the issues discussed above.
For example, horizontal or pairwise add instructions, such as Intel SSE3's PHADDW:
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
1 packed vector output
A0+A1, A2+A3 ..., B0+B1, B2+B3, ...
Pairwise add could, of course, be done on a single vector operand, but then the
output would only be half-length.
Combining two vector operands in this manner produces a full-length vector output,
which is convenient.
Wiring-wise, if the A and B vector operands are interleaved
to allow [[vertical add]], this only requires wires sufficient to
route A1 to A0, and vece versa.
Note that horizontal add is "only" a combination of a vector interleave or shuffle operation,
and a vertical add. The interleave operation, however, wants to produce two vector outputs,
or else occupy two instructions (or produce a double width vector operand, or pack two singles into
double precision vector elements (but then what do you do for double precision inputs), or ...)
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
2 packed vector outputs
A0, B0, A2, B2 ...
A1, B1, A3, B3 ...
Vertical add
A0+A1, B0+B1, ...
Vertical, element by element, multiplication,
combined with sufficient horizontal adds to perform the sum,
builds a dot product.
Or, a single instruction may combine vertical multiplication and horizontal pairwise addition:
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
1 result
A0*B0+A1*B1, A2*B2+A3*B3, ...
E.g. Intel SSE3's PMADDUBSW.
The putative problem of having only half as many vector elements in the PMADDUBSW output is
handled by doubling the precision,
taking 8 bits as input and producing a packed 16 bit vector as output.
Unfortunately, that technique does not work so well for floating point.
= Vector Parallel versus Vector Sequential =
Defining dot product with leftmost associativity
for purposes of overflow and rounding,
is well suited to a [[vector sequential]] implementation.
[[Vector parallel]] implementations are better suited to horizontal adds,
and other parallel orders of summation.
Modern instruction sets such as SSE tend to have [[vector parallel]] implementations.
It is possible to mix vector parallel and vector sequential,
although a latency penalty is paid when transitioning from vector sequential to vector parallel.
Thus we begin sliding down [[The Slippery Slope of Vector Instruction Sets]]:
there are so many different possible definitions of the vector instructions,
with different evaluation orders,
suited to different microarchitectures.
It is easy to define the most basic vector instructions.
Completing the coverage of vector instructions is more challenging.
WebDAV for my wiki?
I want to have drawings, etc., for my http://comp-arch.net wiki.
But I have thrashed on how to get them. I've tried too many different drawing tools for wiki - all have weaknesses.
I am now leaning towards WebDAV. Indeed, I am leaning towards rebuilding my wiki around WebDAV.
WebDAV basically is a filesystem implemented across web protocols.. If you are lucky enough, your favorite drawing program knows how to access a file stored on WebDAV. Or, you may have a tool that allows you to mount WebDAV as a filesystem on Windows or Linux or whatever, allowing almost any tools to access it.
The problem with WebDAV is that it IS a filesystem Just exporting a filesystem across the web is not very interesting. Listings of files in whatever tools you want - PowerPoint, Visio, Inkscape, etc.
But, I think if I write my own WebDAV server, combined with a wiki, it can be better than this. When a file is stored to WebDAV, I can arrange to convert it (if I understand its format) into a format that can be directly viewed by most, if not all, web browsers. E.g. GIF, although I am more hopeful about a vector forma like SVG. Plus, I can provide a link to the original format, .PPT or .VIS or whatever.
Not just for drawings.
Wiki text might be edited by a wiki editor, or potentially accessed by WebDAV and edited in your favorite text editor. My favorite would be EMACS. In org mode?
Heck, Word and PowerPoint and PDF and PS coud be uploaded. Although I prefer not. Viewed in the browser, if known how.
---
I was planning on a WebDAV server with git or hg or bzr as the versioning backend.
All I really want from WebDAV is the ability to click on an edit link and, on your favorite OS, open whatever program is associated with that file's type.
E.g. for a .PPT open PowerPoint or OpenOffice; for a .VIS (or whatever Visio uses) open Visio (if there's any Open Source tool that understands Visio files, I want to hear about it). Ditto Inkscape. .TXT or .WIKI in Emacs or your favorite text editor.
And, similar, on saving the file, have my own hooks to translate into some other format.
I'll probably want to provide some canned, in a web page, wiki text editor. In fact, I probably want to discourage tesxt being written except in a standard wiki markup. (Although having more than one wiki markup has some attractions. E.g. because I crosspost between wki and this blog. At least I want restructured text, since I am in the habit of cross posting wiki pages and comp.arch on USEnet. RST is much more readable that standard mediawiki or twiki markup.)
And maybe even a canned, in a web page, drawing editor.
But, my motivation for WebDAV is that I haven't found a drawing editor that makes me happy. Why reinvent the wheel?
The main problem with existing drawing editors is that their drawings cannot be displayed inline in web pages. But when I realized I can set my own put-time hooks in my own WebDAV server that seems more doable.
But I have thrashed on how to get them. I've tried too many different drawing tools for wiki - all have weaknesses.
I am now leaning towards WebDAV. Indeed, I am leaning towards rebuilding my wiki around WebDAV.
WebDAV basically is a filesystem implemented across web protocols.. If you are lucky enough, your favorite drawing program knows how to access a file stored on WebDAV. Or, you may have a tool that allows you to mount WebDAV as a filesystem on Windows or Linux or whatever, allowing almost any tools to access it.
The problem with WebDAV is that it IS a filesystem Just exporting a filesystem across the web is not very interesting. Listings of files in whatever tools you want - PowerPoint, Visio, Inkscape, etc.
But, I think if I write my own WebDAV server, combined with a wiki, it can be better than this. When a file is stored to WebDAV, I can arrange to convert it (if I understand its format) into a format that can be directly viewed by most, if not all, web browsers. E.g. GIF, although I am more hopeful about a vector forma like SVG. Plus, I can provide a link to the original format, .PPT or .VIS or whatever.
Not just for drawings.
Wiki text might be edited by a wiki editor, or potentially accessed by WebDAV and edited in your favorite text editor. My favorite would be EMACS. In org mode?
Heck, Word and PowerPoint and PDF and PS coud be uploaded. Although I prefer not. Viewed in the browser, if known how.
---
I was planning on a WebDAV server with git or hg or bzr as the versioning backend.
All I really want from WebDAV is the ability to click on an edit link and, on your favorite OS, open whatever program is associated with that file's type.
E.g. for a .PPT open PowerPoint or OpenOffice; for a .VIS (or whatever Visio uses) open Visio (if there's any Open Source tool that understands Visio files, I want to hear about it). Ditto Inkscape. .TXT or .WIKI in Emacs or your favorite text editor.
And, similar, on saving the file, have my own hooks to translate into some other format.
I'll probably want to provide some canned, in a web page, wiki text editor. In fact, I probably want to discourage tesxt being written except in a standard wiki markup. (Although having more than one wiki markup has some attractions. E.g. because I crosspost between wki and this blog. At least I want restructured text, since I am in the habit of cross posting wiki pages and comp.arch on USEnet. RST is much more readable that standard mediawiki or twiki markup.)
And maybe even a canned, in a web page, drawing editor.
But, my motivation for WebDAV is that I haven't found a drawing editor that makes me happy. Why reinvent the wheel?
The main problem with existing drawing editors is that their drawings cannot be displayed inline in web pages. But when I realized I can set my own put-time hooks in my own WebDAV server that seems more doable.
Saturday, October 16, 2010
Why Vectors Increase Frequency of Misalignment
= RISC handling of misaligned memory operations =
TBD: make this into a topic page of its wown.
It was a truism during the so-called [[RISC revolution]] that support for misaligned memory operations was unnecessary.
In particular, the rather expensive [[memory alignment mux]] could be greatly reduced in cost,
and the complexity involved in handling memory operands that cross cacheline boundaries or page boundaries could be reduced.
Compilers could either arrange to use [[naturally aligned]] data,
or they could arrange to use instruction sequences such as that below for data that might not be aligned:
loading possible misaligned data, address in rA
rAlo := rA & ~(size-1)
rAHi := rAlo + size
rDlo := load( rAlo )
rDhi := load( rAhi )
rAlobits := rA & (size-1)
rAlobits := size - rAlobits
rAlobitsMask := (1<
rAhibits := size - rAlobits
rAhbitsMask := (1<
rDlo := rDlo & rAlobitsMask
rDhi := rDhi & rAhibitsMask
rD := rDlo | rDhi
You should get the idea (even in the likely case that there are errors above).
Frequently the compiler can simplify this. It nearly always knows the size, which is nearly always a power of 2.
It often knows the alignment, e.g. of a field in a struct whose alignment is known.
But there will arise cases where these things are not known.
Since the general case of compiled code to perform such boundary croossing support is rather complex,
RISC machines often combine many of these operations into a pair of instructions
rDlo := load_lo.size( rA )
rDhi := load_hi.size( rA )
rD := rDlo | rDhi
The final OR instruction to merge the low and high parts may be eliminated, e.g. by [[writing into instead of onto]] the destination register,
or by having an implicit merge.
rD := load_lo.size( rA )
rD := load_hi.size( rA )
TBD: describe the several ways of doing this compaction.
Note that machines with support for misaligned menory operations often use similar approoaches,
except that the may first attempt an aligned load, for example, and then fall back to such a misaligned sequence.
(TBD: describe forms of misaligned support.)
= Why Vectors Increase Frequency of Misalignment =
Unfortunately, the advent of small vector or packed instruction sets,
such as Intel MMX, SSE, or PowerPC Altivec,
led to an increase in the frequency of misaligned memory operations.
Briefly: users of vectors often want to load subvectors or slices of other vectors.
E.g. given a big vector V[0...1023] in memory, the user may want to operate 4 elements at a time.
V[0-3] would be aligned,
but V[1-4] would be misaligned.
The problem of motion estimation in digital video is a very good example of this.
== Motion Estimation and Misaligned Memory References ==
Digital video operates on sequences of frames, each frame beiing a MxN matrix of pixels
P[0,0] ... P[0,M-1]
... ... ...
P[M-1,0] ... P[M-1,N-1]
where (M,N) may be values such as CIF (288,252). Much larger videos are now common.
Digital video is usually compressed.
One common form of compression is to find parts of the image that seem to have moved between frames,
and then transmit, rather than the entire region that has moved, a motion vector,
and some correction values.
Motion compensation is the process of reconstructing the pixels from this information;
motion estimation is the proceess of finding such motion vectors and correction values.
Motion estimation is much more computationally intensive than motion compensation.
Most motion estimation algorithms involve a search.
They typically start with a block of pixels in Frame1
Frame1
P[R,C] ... P[0,C+3]
... ... ...
P[R+3,0] ... P[R+3,C+3]
and compare it, in succession, to offset frames in Frame2
Frame2
P[R+dR,C+dC] ... P[0+dR,C+3+dC]
... ... ...
P[R+3+dR,0+dC] ... P[R+3+dR,C+3+dC]
Typically the comparisons are done a row at a time, and then summed over all rows in the blocks:
tmp1 := load( Frame1.P[R,C..C+3] )
tmp2 := load( Frame2.P[R+dR,C+dC..C+3+dC] )
The exact search patterns vary by algorithm, but the offset values (dR,dC) nearly always include distances such as 0, 1, 2... etc pixels.
In fact, they often include sub-pixel granularity, such as 1/2 and 1/4 pixels.
It can therefore be seen that motion estimation naturally involves vectors that are not [[naturally aligned]].
Although the pixels may be naturally aligned, a group of 4 pixels P[1..4] is not.
There have been various attempts to mitigate the alignment issues.
For example, for motion estimation some forms of the [[MPSAD]] instruction may load double width slices, and extract all possible subslices.
These are a special form of [[convolution instructions]] or [[moving window instructions]].
However, this is toothpaste: when one source of vector misalignment has been cured, another crops up somewhere else.
TBD: make this into a topic page of its wown.
It was a truism during the so-called [[RISC revolution]] that support for misaligned memory operations was unnecessary.
In particular, the rather expensive [[memory alignment mux]] could be greatly reduced in cost,
and the complexity involved in handling memory operands that cross cacheline boundaries or page boundaries could be reduced.
Compilers could either arrange to use [[naturally aligned]] data,
or they could arrange to use instruction sequences such as that below for data that might not be aligned:
loading possible misaligned data, address in rA
rAlo := rA & ~(size-1)
rAHi := rAlo + size
rDlo := load( rAlo )
rDhi := load( rAhi )
rAlobits := rA & (size-1)
rAlobits := size - rAlobits
rAlobitsMask := (1<
rAhbitsMask := (1<
rDlo := rDlo & rAlobitsMask
rDhi := rDhi & rAhibitsMask
rD := rDlo | rDhi
You should get the idea (even in the likely case that there are errors above).
Frequently the compiler can simplify this. It nearly always knows the size, which is nearly always a power of 2.
It often knows the alignment, e.g. of a field in a struct whose alignment is known.
But there will arise cases where these things are not known.
Since the general case of compiled code to perform such boundary croossing support is rather complex,
RISC machines often combine many of these operations into a pair of instructions
rDlo := load_lo.size( rA )
rDhi := load_hi.size( rA )
rD := rDlo | rDhi
The final OR instruction to merge the low and high parts may be eliminated, e.g. by [[writing into instead of onto]] the destination register,
or by having an implicit merge.
rD := load_lo.size( rA )
rD := load_hi.size( rA )
TBD: describe the several ways of doing this compaction.
Note that machines with support for misaligned menory operations often use similar approoaches,
except that the may first attempt an aligned load, for example, and then fall back to such a misaligned sequence.
(TBD: describe forms of misaligned support.)
= Why Vectors Increase Frequency of Misalignment =
Unfortunately, the advent of small vector or packed instruction sets,
such as Intel MMX, SSE, or PowerPC Altivec,
led to an increase in the frequency of misaligned memory operations.
Briefly: users of vectors often want to load subvectors or slices of other vectors.
E.g. given a big vector V[0...1023] in memory, the user may want to operate 4 elements at a time.
V[0-3] would be aligned,
but V[1-4] would be misaligned.
The problem of motion estimation in digital video is a very good example of this.
== Motion Estimation and Misaligned Memory References ==
Digital video operates on sequences of frames, each frame beiing a MxN matrix of pixels
P[0,0] ... P[0,M-1]
... ... ...
P[M-1,0] ... P[M-1,N-1]
where (M,N) may be values such as CIF (288,252). Much larger videos are now common.
Digital video is usually compressed.
One common form of compression is to find parts of the image that seem to have moved between frames,
and then transmit, rather than the entire region that has moved, a motion vector,
and some correction values.
Motion compensation is the process of reconstructing the pixels from this information;
motion estimation is the proceess of finding such motion vectors and correction values.
Motion estimation is much more computationally intensive than motion compensation.
Most motion estimation algorithms involve a search.
They typically start with a block of pixels in Frame1
Frame1
P[R,C] ... P[0,C+3]
... ... ...
P[R+3,0] ... P[R+3,C+3]
and compare it, in succession, to offset frames in Frame2
Frame2
P[R+dR,C+dC] ... P[0+dR,C+3+dC]
... ... ...
P[R+3+dR,0+dC] ... P[R+3+dR,C+3+dC]
Typically the comparisons are done a row at a time, and then summed over all rows in the blocks:
tmp1 := load( Frame1.P[R,C..C+3] )
tmp2 := load( Frame2.P[R+dR,C+dC..C+3+dC] )
The exact search patterns vary by algorithm, but the offset values (dR,dC) nearly always include distances such as 0, 1, 2... etc pixels.
In fact, they often include sub-pixel granularity, such as 1/2 and 1/4 pixels.
It can therefore be seen that motion estimation naturally involves vectors that are not [[naturally aligned]].
Although the pixels may be naturally aligned, a group of 4 pixels P[1..4] is not.
There have been various attempts to mitigate the alignment issues.
For example, for motion estimation some forms of the [[MPSAD]] instruction may load double width slices, and extract all possible subslices.
These are a special form of [[convolution instructions]] or [[moving window instructions]].
However, this is toothpaste: when one source of vector misalignment has been cured, another crops up somewhere else.
Sunday, October 10, 2010
One of those days...
It's been one of those days (weekends) at Mudshoe Manor.
Yesterday my daughter's tablet PC, an HP tx1000 (more precisely, a tx1420us) refused to power on. So I set her up with an account on my tablet PC to do her weekend's homework. In the meantime, I unplugged and removed the battery on her tablet.
By the end of the day, her machine was working. So we transferred her homework to her machine. This morning, she made some revisions, fixing the glaring speech recognition errors. (Yes, Sophie has been using Vista's speech recognition.)
And when we broke for lunch, of course, her tablet PC broke again. this time hard, not fixed by removing power. The blue lightning icon flashes briefly, and then nothing.
So she moved back to my tablet PC to do the rest of her homework. Yes, she had to redo her revisions of the morning.
While she is using my tablet PC, I google hers. Ouch. Apparently the HP tx1000 series has well-known problems with overheating caused by its AMD CPU and, more notoriously, bad solder or wire bonds or something similar on its Nvidia graphics chip. Unfortunately, the simple fix, pressing down the JKL keys while booting, doesn't work.
Youtube is full of videos on how to repair this, usually amounting to opening the case, removing the motherboard, heating up the Nvidia GPU chip without heating anything else, pressing down, and then attaching a penny or a quarter to provide better contact to the heatsink - the heatsink apparently not having a bar dedicated to the GPU.
I've been reading about the Nvidia class action suits. Apparently now I am involved - maybe. Although the problem is notorious on the Internet, apparently HP has not officially acknowledged it.
So, this is where I'm at: do I open up the case to do the DIY fix, or do I wait on the phone to see if HP will deliver a fix pursuant to the Nvidia class action? I'll probably do the latter, and if that fails, the former.
In the meantime, my daughter is PC-less. Or, rather, I am PC-less while she uses my PC.
I'm writing this on our old, and hitherto almost useless, 2go PC - one of the original classmate PCs, early netbooks, from Intel's employee purchase program. Plugged it in and booted it for the first time in almost a year. After installing umpteen upgrades and security patches, I installed the Diamond BVU195 USB display adapter drivers, and Chrome - and now this formerly useless and slow PC is reborn as a web browser appliance. Seems fast enough for blogging.
It's quite amazing to see the 2goPC, with its 1024x600 display, drive 3 1920x1200 monitors.
I love USB display adapters!!!! I hope the Linux drivers are available.
It's like dominoes: one computer has a problem, and I have to fix another, and then another. My daughter's machine doesn't have much on it - we mainly are cloud based - but just in case, I am going to pull the drive. Unfortunately, I don't have another drive with enough space to transfer to. My own tablet PC is full. So, this finally forced me to find the disk drive upgrade for my tablet PC that I had been sitting on, and start transferring to that.
Off to find open source drive imaging software.
I hate computers.
Yesterday my daughter's tablet PC, an HP tx1000 (more precisely, a tx1420us) refused to power on. So I set her up with an account on my tablet PC to do her weekend's homework. In the meantime, I unplugged and removed the battery on her tablet.
By the end of the day, her machine was working. So we transferred her homework to her machine. This morning, she made some revisions, fixing the glaring speech recognition errors. (Yes, Sophie has been using Vista's speech recognition.)
And when we broke for lunch, of course, her tablet PC broke again. this time hard, not fixed by removing power. The blue lightning icon flashes briefly, and then nothing.
So she moved back to my tablet PC to do the rest of her homework. Yes, she had to redo her revisions of the morning.
While she is using my tablet PC, I google hers. Ouch. Apparently the HP tx1000 series has well-known problems with overheating caused by its AMD CPU and, more notoriously, bad solder or wire bonds or something similar on its Nvidia graphics chip. Unfortunately, the simple fix, pressing down the JKL keys while booting, doesn't work.
Youtube is full of videos on how to repair this, usually amounting to opening the case, removing the motherboard, heating up the Nvidia GPU chip without heating anything else, pressing down, and then attaching a penny or a quarter to provide better contact to the heatsink - the heatsink apparently not having a bar dedicated to the GPU.
I've been reading about the Nvidia class action suits. Apparently now I am involved - maybe. Although the problem is notorious on the Internet, apparently HP has not officially acknowledged it.
So, this is where I'm at: do I open up the case to do the DIY fix, or do I wait on the phone to see if HP will deliver a fix pursuant to the Nvidia class action? I'll probably do the latter, and if that fails, the former.
In the meantime, my daughter is PC-less. Or, rather, I am PC-less while she uses my PC.
I'm writing this on our old, and hitherto almost useless, 2go PC - one of the original classmate PCs, early netbooks, from Intel's employee purchase program. Plugged it in and booted it for the first time in almost a year. After installing umpteen upgrades and security patches, I installed the Diamond BVU195 USB display adapter drivers, and Chrome - and now this formerly useless and slow PC is reborn as a web browser appliance. Seems fast enough for blogging.
It's quite amazing to see the 2goPC, with its 1024x600 display, drive 3 1920x1200 monitors.
I love USB display adapters!!!! I hope the Linux drivers are available.
It's like dominoes: one computer has a problem, and I have to fix another, and then another. My daughter's machine doesn't have much on it - we mainly are cloud based - but just in case, I am going to pull the drive. Unfortunately, I don't have another drive with enough space to transfer to. My own tablet PC is full. So, this finally forced me to find the disk drive upgrade for my tablet PC that I had been sitting on, and start transferring to that.
Off to find open source drive imaging software.
I hate computers.
Friday, October 08, 2010
MS License Lock Box - yeah, right
Last year I gave in and purchased a copy of Microsoft Office. "Gave in" because I mainly use Open Office. Gave in, because (a) there was a cheap deal on Office 2007 Home and Student - 80$ for license to install on three machines, (b) I had just realized that my Vista tablet had functional speech recognition, no need to buy expensive Dragon, and my hands were in spasm (much as I prefer open source, I am not aware of any free software that does good speech recognition, or handwriting for that matter), and (c) I have grown to depend on Microsoft OneNote. (This last really minor, just a nice to have.)
80$ to install on three machines. Sounds good, eh?
Also, at the same time I purchased the optional "License Lock Box", which supposedly gave me the right to re-download the software for 3 years.
I installed the software immediately on my tablet. Did not get around to installing it on my wife & daughter's compute(s) for a few months. Was annoyed when I tried to that the original download rights lasted 60 days. I thought the License Lock Box was supposed to give me more time than that, but could not figure it out.
Wrote it off as yet more money poured down the drain into Microsoft.
Today, decided to try to set my daughter up. Her mother and I are Luddites: my daughter can't tupe. But here new school requires assignments on computer. Typing lessons are icumin in, but in the meantime, speech recognition. Which works better on Vista, using Microsoft Office.
Since I had long given up hope of using my Office 2007 license, I started purchasing Office 2010. Filled out all the forms, pressed the button "after which your credit card will be charged" - and was taken to a page that said "No such page found". On Microsoft's store website, using Microsoft Internet Explorer.
Sigh. Called Microsoft Store help. It appears that my transaction did not go through. But while I was on the line, asked about my Office 2007 license, and the "Licensing Lock Box".
The very helpful guy on the phone explained that, at that time, software sold from Microsoft's website was sold through a reseller, Digital River. And that the License Lock Box was their product, not Microsoft's. Aha!
Found that I could log in - with one of the umpteen Microsoft LiveId accounts.
Curiously, with Chrome and Firefox I can see my purchase history. But I can't see the "Download Now" button that I am supposed to use the re-download.
But with Internet Explore, I get told they have no history of my purchase. Again, it is curious that such a Microsoft-related site works worse with Internet Explorer than with non-MS browsers. But then again, it doesn't work welll with either.
So, annoyingly, it looks as if I *MIGHT* be able to exercise my license rights to 3 installs of Offoce 2007. But because of website brokenness, not before Monday. And Sophie has homework to do this weekend.
God! This is one of the big advantages of Open Source: if you have a problem, you can just re-download at any time. On any computer. No licensing hoops to jump through using information that can only be finagled out of the software reseller''s broken website on weekdays.
---
I thought that buying this "License Lock Box" was a Microsoft approximation to always being able to download the latest software from the cloud. Maybe so - but it's a bad approximation. Apparently not even from Microsoft.
Although the helpful guy at the Microsoft store said that they allow infinite re-downloads.
80$ to install on three machines. Sounds good, eh?
Also, at the same time I purchased the optional "License Lock Box", which supposedly gave me the right to re-download the software for 3 years.
I installed the software immediately on my tablet. Did not get around to installing it on my wife & daughter's compute(s) for a few months. Was annoyed when I tried to that the original download rights lasted 60 days. I thought the License Lock Box was supposed to give me more time than that, but could not figure it out.
Wrote it off as yet more money poured down the drain into Microsoft.
Today, decided to try to set my daughter up. Her mother and I are Luddites: my daughter can't tupe. But here new school requires assignments on computer. Typing lessons are icumin in, but in the meantime, speech recognition. Which works better on Vista, using Microsoft Office.
Since I had long given up hope of using my Office 2007 license, I started purchasing Office 2010. Filled out all the forms, pressed the button "after which your credit card will be charged" - and was taken to a page that said "No such page found". On Microsoft's store website, using Microsoft Internet Explorer.
Sigh. Called Microsoft Store help. It appears that my transaction did not go through. But while I was on the line, asked about my Office 2007 license, and the "Licensing Lock Box".
The very helpful guy on the phone explained that, at that time, software sold from Microsoft's website was sold through a reseller, Digital River. And that the License Lock Box was their product, not Microsoft's. Aha!
Found that I could log in - with one of the umpteen Microsoft LiveId accounts.
Curiously, with Chrome and Firefox I can see my purchase history. But I can't see the "Download Now" button that I am supposed to use the re-download.
But with Internet Explore, I get told they have no history of my purchase. Again, it is curious that such a Microsoft-related site works worse with Internet Explorer than with non-MS browsers. But then again, it doesn't work welll with either.
So, annoyingly, it looks as if I *MIGHT* be able to exercise my license rights to 3 installs of Offoce 2007. But because of website brokenness, not before Monday. And Sophie has homework to do this weekend.
God! This is one of the big advantages of Open Source: if you have a problem, you can just re-download at any time. On any computer. No licensing hoops to jump through using information that can only be finagled out of the software reseller''s broken website on weekdays.
---
I thought that buying this "License Lock Box" was a Microsoft approximation to always being able to download the latest software from the cloud. Maybe so - but it's a bad approximation. Apparently not even from Microsoft.
Although the helpful guy at the Microsoft store said that they allow infinite re-downloads.
Thursday, October 07, 2010
Accumulator ISA vs Uarch
wiki at: http://semipublic.comp-arch.net/wiki/Accumulator_ISA_vs_Uarch
oldder article on accumulators:
http://semipublic.comp-arch.net/wiki/Accumulators
I'm at home sick today, so I've got to be careful: last time I posted on comp.arch when I was home sick, I set the agenda for 4+ years of work.
First: 22+ gates may well fit in a pipestage. People are headed past 28 FO4, in an effort to tolerate device variation (the more gates / pipestage, the higher your yield).
Of course, usually when you do this we end up trying to cram several pipestages worth of work into one. As you might say redundant bypassing does.
---
As for accumulator architectures and microarchitectures, love/hate.
First, we must distinguish instruction set architecture from microarchitecture.
= Accumulator ISA =
Accumulator ISAs save bits. May lead to very small code. When we have the ability to reorganize instructions over long distances, accumulator ISAs don't need to hurt performance.
By the way, some might say that "2 operand" ISAs are accumulator based:
rD += rS
Myself, although more accumulator like that three operand (I hate these terms, but must use them)
rD := rS1 + rS2
I tend to be a stickler, and say that either there is single accumulator
acc += rS
or maybe a small number of accumulators, acc0 - 3 versus r0-32
acc0-3 += rS0-31
== Accumulator versus Stack ==
By the way, for similar instruction size issues, I sometimes look fondly on stack ISAs. Even fewer bits than accumulators.
= Accumulator Microarchitecture =
Now, let's transition into microarchitecture. I call it an accumulator microarchitecture when the accumulator (or acccumulators) are "close to" one or more ALUs.
I.e. you may have a small number, say 4, of accumulators near to every individual ALU. E.g. you might have 3 ALUs, with 4 accumulators each.
Or you may have a small number of accumulators shared between a group of ALUs.
Physically, the term accumulator may be associated with a different circuit design: e.g. the accumulators might be flops, whereas the registers might be in an SRAM array.
I don't call it an accumulator microarchitecture when you have, say, 2 clusters of 4 ALUs, each tightly bound to 16 registers. I call that a partitioned register file. But I might be willing to call it an accumulator if there were only 4 such tightly bound registers next to the ALUs. Obviously, this is a matter of degree.
I might call it an accumulator architecture if there were 2 such ALU clusters, each with 4 accumulator registers, and a big shared pool of 64 registers eqaually accessible to all. But I would require that there
be instructions of the form
cluster.accum_i += global_reg_j
If the global registers could not be used as operands of at least the usual ADD instruction, but were restricted to moving to and from the accumulators, I would call it a two level register file, not an accumulator microarchitecture.
== OOO implementation of Accumulator ISA ==
Accumulator ISAs implemented on accumulator microarchitectures are natiral for in-order machines.
On an out-of-order machine, it is trivial to rename the accumulators of the ISA to a single PRF. In which case, the hypothetical advantages of the accumulator ISA are lost.
== OOO implementation of Accumulator Microarchitecture ==
Conversely, it is straightforward to implement a "bypass cache" for a flat register file instruction set, taking advantage of the fact that a value need not be written into the main register file before it is reused. A bypass cache is a generalization of generic bypassing: a bypass cache adds state, and is not just restricted to bypassing from operands in flight.
The bypass cache state is very much like an accumulator microarchitecture. With at least one important difference:
If the bypass cache is CAM indexed by the physical register number in the main RF, then that is an important difference. A second important difference is whether the bypass cache can miss - whether a value you expect to receive from the bypasses or the bypass cache may not be there when you expect it to be, so that you have to fetch it froom the main RF. I sometimes call this a priori vs a posteriori caching.
If, however, there is separate renaming for accumulator registers and main registers, then you more exactly have an OOO implementation of, say, a flat RF ISA on an accumulator uarch. But you have all sorts of questions about policy: when do you make a copy of a main RF register in an accumulator, and vice versa.
(Hey, here's an idea: in the scheduler, CAM on main REF regnums. But also broadcast when a main RF regnum has been moved to an accumulator. Issue the instruction with non-CAM indexed accum regnums... How handlle replacemeent? Probably LRU, modified by overiting - so many values are only read once.)
== OOO implementattion of Accumulator ISA on Accumulator Uarch ==
What about combining the two? Well, obviously, you could rename from ISA accum into flat RF, and then into uarch accums. But that rather defeats the purpose.
We might rename accums and main regs separately. The ISA accums guide replacement in the physical accums.
Issue: what if ISA accums and physical accums are not the same in number?
Here's an observation: one of the biggest advantages of accumulator ISAs is that they tend to reuse the accumulators quickly. I.e. they tell us when values are dead, and hence can be more quickly recycled. You might get the same effect by a "kill register" bit.
= Summary =
Accumulators, whether ISA or uarch, are part of our tool kit.
I do not thing that there is a uniform way of saying one is better than another. Certainly, at the moment flat RFs preponderate.
Much of the benefit of accumulators can be obtained by bypass caching.
I would hesitate to add true accumulators to a new ISA unless (a) codesize really matters, or (b) it is an embedded system, with the ability to change uarch and ISA every generation.
(But then, that's what they always say.)
On the other hand, I am scared of implementing accumulators, whether in ISA or uarch.
== Accumulators for Extended Precision ==
I forgot to mention one of the biggest advantages of accumulators: extended intermediate precision.
Many DSPs have, for example, 24 bit accumulators to add chains of 16 bit numbers.
One might consider FMA or FMAC, floating point multipply add without intermediate rounding, to be a hidden internal accumulator. Except that it isn't - it flows. If you exposed that intermediate value...
I keep waiting for a nice formulation of exact, infinte precision, vector sum reductions, dot products, etc. Trivially, that can be done via a [[superaccumulator]] - but although I like superaccumulators, they are pretty expensive. What, 1K or 2K bits?
An ISA-level accumulator architecture of the form
superacc := 0
for all i superacc += a[i]
is nice in that you could imagine mapping it to the infinite internal precision versions of 4-way dot product, etc., that one sees in some instruction sets. (E.g. Microunity)
I'd like to see lower cost but still infinite precision wayys ofdoing such FP vector reductions.
oldder article on accumulators:
http://semipublic.comp-arch.net/wiki/Accumulators
I'm at home sick today, so I've got to be careful: last time I posted on comp.arch when I was home sick, I set the agenda for 4+ years of work.
First: 22+ gates may well fit in a pipestage. People are headed past 28 FO4, in an effort to tolerate device variation (the more gates / pipestage, the higher your yield).
Of course, usually when you do this we end up trying to cram several pipestages worth of work into one. As you might say redundant bypassing does.
---
As for accumulator architectures and microarchitectures, love/hate.
First, we must distinguish instruction set architecture from microarchitecture.
= Accumulator ISA =
Accumulator ISAs save bits. May lead to very small code. When we have the ability to reorganize instructions over long distances, accumulator ISAs don't need to hurt performance.
By the way, some might say that "2 operand" ISAs are accumulator based:
rD += rS
Myself, although more accumulator like that three operand (I hate these terms, but must use them)
rD := rS1 + rS2
I tend to be a stickler, and say that either there is single accumulator
acc += rS
or maybe a small number of accumulators, acc0 - 3 versus r0-32
acc0-3 += rS0-31
== Accumulator versus Stack ==
By the way, for similar instruction size issues, I sometimes look fondly on stack ISAs. Even fewer bits than accumulators.
= Accumulator Microarchitecture =
Now, let's transition into microarchitecture. I call it an accumulator microarchitecture when the accumulator (or acccumulators) are "close to" one or more ALUs.
I.e. you may have a small number, say 4, of accumulators near to every individual ALU. E.g. you might have 3 ALUs, with 4 accumulators each.
Or you may have a small number of accumulators shared between a group of ALUs.
Physically, the term accumulator may be associated with a different circuit design: e.g. the accumulators might be flops, whereas the registers might be in an SRAM array.
I don't call it an accumulator microarchitecture when you have, say, 2 clusters of 4 ALUs, each tightly bound to 16 registers. I call that a partitioned register file. But I might be willing to call it an accumulator if there were only 4 such tightly bound registers next to the ALUs. Obviously, this is a matter of degree.
I might call it an accumulator architecture if there were 2 such ALU clusters, each with 4 accumulator registers, and a big shared pool of 64 registers eqaually accessible to all. But I would require that there
be instructions of the form
cluster.accum_i += global_reg_j
If the global registers could not be used as operands of at least the usual ADD instruction, but were restricted to moving to and from the accumulators, I would call it a two level register file, not an accumulator microarchitecture.
== OOO implementation of Accumulator ISA ==
Accumulator ISAs implemented on accumulator microarchitectures are natiral for in-order machines.
On an out-of-order machine, it is trivial to rename the accumulators of the ISA to a single PRF. In which case, the hypothetical advantages of the accumulator ISA are lost.
== OOO implementation of Accumulator Microarchitecture ==
Conversely, it is straightforward to implement a "bypass cache" for a flat register file instruction set, taking advantage of the fact that a value need not be written into the main register file before it is reused. A bypass cache is a generalization of generic bypassing: a bypass cache adds state, and is not just restricted to bypassing from operands in flight.
The bypass cache state is very much like an accumulator microarchitecture. With at least one important difference:
If the bypass cache is CAM indexed by the physical register number in the main RF, then that is an important difference. A second important difference is whether the bypass cache can miss - whether a value you expect to receive from the bypasses or the bypass cache may not be there when you expect it to be, so that you have to fetch it froom the main RF. I sometimes call this a priori vs a posteriori caching.
If, however, there is separate renaming for accumulator registers and main registers, then you more exactly have an OOO implementation of, say, a flat RF ISA on an accumulator uarch. But you have all sorts of questions about policy: when do you make a copy of a main RF register in an accumulator, and vice versa.
(Hey, here's an idea: in the scheduler, CAM on main REF regnums. But also broadcast when a main RF regnum has been moved to an accumulator. Issue the instruction with non-CAM indexed accum regnums... How handlle replacemeent? Probably LRU, modified by overiting - so many values are only read once.)
== OOO implementattion of Accumulator ISA on Accumulator Uarch ==
What about combining the two? Well, obviously, you could rename from ISA accum into flat RF, and then into uarch accums. But that rather defeats the purpose.
We might rename accums and main regs separately. The ISA accums guide replacement in the physical accums.
Issue: what if ISA accums and physical accums are not the same in number?
Here's an observation: one of the biggest advantages of accumulator ISAs is that they tend to reuse the accumulators quickly. I.e. they tell us when values are dead, and hence can be more quickly recycled. You might get the same effect by a "kill register" bit.
= Summary =
Accumulators, whether ISA or uarch, are part of our tool kit.
I do not thing that there is a uniform way of saying one is better than another. Certainly, at the moment flat RFs preponderate.
Much of the benefit of accumulators can be obtained by bypass caching.
I would hesitate to add true accumulators to a new ISA unless (a) codesize really matters, or (b) it is an embedded system, with the ability to change uarch and ISA every generation.
(But then, that's what they always say.)
On the other hand, I am scared of implementing accumulators, whether in ISA or uarch.
== Accumulators for Extended Precision ==
I forgot to mention one of the biggest advantages of accumulators: extended intermediate precision.
Many DSPs have, for example, 24 bit accumulators to add chains of 16 bit numbers.
One might consider FMA or FMAC, floating point multipply add without intermediate rounding, to be a hidden internal accumulator. Except that it isn't - it flows. If you exposed that intermediate value...
I keep waiting for a nice formulation of exact, infinte precision, vector sum reductions, dot products, etc. Trivially, that can be done via a [[superaccumulator]] - but although I like superaccumulators, they are pretty expensive. What, 1K or 2K bits?
An ISA-level accumulator architecture of the form
superacc := 0
for all i superacc += a[i]
is nice in that you could imagine mapping it to the infinite internal precision versions of 4-way dot product, etc., that one sees in some instruction sets. (E.g. Microunity)
I'd like to see lower cost but still infinite precision wayys ofdoing such FP vector reductions.
Subscribe to:
Posts (Atom)