The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, March 01, 2011

Stream Processors, specifically Stream Descriptor Processors

{{Terminology Term}}

= [[Stream processor]] is an overloaded term =

The term [[stream processor]] is fashionable and overloaded.
It often means just a processor that is "optimized" for a [[stream workload]].
Where "optimized" may mean just "has little cache".
(E.g. the use of the term "stream processor" in [[GPU]]s such as tyhose by AMD/ATI and Nvidia.)

= What is a [[stream]]? =

A [[stream workload]] operates on sets of data,
operating on one or a few data elements
before moving on,
not to return to those data elements for a long time.
Because of this access pattern,
[[LRU replacement]] is often ineffective, and may even be pessimal,
for [[stream workload]]s.
Keeping cache entries around that are not going to be reused is wasteful.
Sometimes a stream is restarted, in which case an MRU policy that keeps the head of the stream in cache is better than LRU.
Sometimes there is no reuse at all.

= [[Stream descriptor processor]] =

By [[stream processor]], I mean processors with [[stream descriptor]]s
- processors that have specific features, whether ISA or microarchitecture,
that support streaming.
(I would prefer to reserve this term for ISA features, but [[many good ISA features become microarchitecture]]].)
Possibly should call them [[stream descriptor processor]]s,
to distinguish them from the overloaded term [[stream processor]].

= [[Stream descriptor]] =

A [[stream descriptor]] is a description, in a register or memory data structure, that describes a memory access pattern.

For example, a [[1D stream descriptor]] might include
* data item description - size, type
* base
* stride
* length
* current position

A [[2D stream descriptor]] might include
* data item description - size, type
* base
* horizontal stride, length
* vertical stride, length
* current position, row and column

Obviously extendable to N dimensions.

In addition to arrays, there may be other types of stream descriptor

A scatter/gather descriptor
* a 1D descriptor of a vector of addresses
* with an indication that each addressshould be dereferenced

A linked list descriptor
* Base
* a skeleton [[structure descriptor]]
** payload to be accessed while traversing the list - offset, type
** position within a list element of pointer to next element

A tree or B-tree descriptor
* Base
* a skeleton [[structure descriptor]]
** payload
** key
** pointers

For the case of a tree or B-tree descriptor,
the access pattern may
(a) traverse all of a tree or subtree
or (b) traverse only a path seeking an individual element

= Why [[stream processor]]s? =

There are two main motivations for stream processors:
* memory access patterns
* data manipulation overhead
I think there is an additional property, probably nort considered important by most, but [[IMHO]] quite important:
* decoupling and ease of programming

; Memory Access Patterns

Support for streaming memory access patterns is probably the most important.
As mentioned above, many [[stream workload]]s do not fit in the cache.
But that could be handled by a simple [[cache bypass hint]] instruction.

Of more long term interest is the possibility that [[stream descriptor]]s, etc., may allow
streams that fit in the cache to be managed.
As cache sizes grow, [[memory data structures]] that did not fit in an old, smaller, cache, may fit in a new, larger, cacge.
E.g. a 1000x1000x32b array does not fit in a 32K [[L1$]], but may fit in an 8M [[L3$]].
Of course, this cannot be managed in isolation:
a workload that accessed only one 2MB 2D array might fit in an 8MB cache, but one that accessed 4 such would not.
Nevertheless, explicitly indicating the data structures provides more information to whoever is trying to manage the cache,
whether hardware or software.
    The [[stream descriptor]] may have a bit associated with it, either inside or [[non-adjacent]], that indicates which cache levels to use.
Furthermore, even in the absence of cache control in the stream descriptor, the stream descriptor provides guidance for [[cache prefetch]].
Guidance that may be more accurate that [[SW prefetch using cache control instructions]].
A stream descriptor may be associated with a prefetch control data structure, that seeks always to avoid delays accessing a stream,
without excessive prefetch.

; Data Manipulation Overhead

Users of [[SIMD packed vector]] instruction sets quickly come to realize
that the actual memory access and computation is only a small part of the task of programming code to use such instructions.
Often more than half of the instructions in such kernels are
data manipulation overhead
* unpacking 8 bit numbers into 16 bit numbers, or, worse, even fancier unpackings such as 5:6:5
* transposng rows and columns
* [[chunky versus planar]] data, i.e. [[SOA versus AOS]]

Explicit support for streams via [[stream descriptor]]s can eliminate much of this overhead, avoiding explicit instructions.

[[The CSI Multimedia Architecture]] goes further: not only does it hide the data packing and unpacking,
but it also seeks to hide the [[SIMD packed vector]] nature of the hardware datapath.
Instead of the programmer loading from a stream into a register of N bits, operating, and then storing,
this paper proposes complex streaming instructions that are implemented N bits at a time
in such a way that the code does not need to be changed when the microarchitecture changes the width of N.

Others have attempted to go even further, providing ways of expressing streaming kernels using simple instructions
that automatically get mapped onto a fixed length datapath.

; Decoupling and ease of programming

Streams are the equivalent of UNIX pipes.

If you had an architecture that allowed [[lightweight thread forking]],
some code could be written by connecting two library function kernels together by a stream
- instead of
* writing two separate loops that store the intermediate value to memory
* or merging the two operations into a single loop for a single pass

PROBLEM: allocating buffer space for such inter-thread streams. It may be necessary to spill to memory in some situations.

= If streams are such a good idea, why aren;t they popular yet? =


Proper [[stream descriptor processor]]s have not taken off by 2010.

Instead, [[SIMD packed vector]] has dominated.

= See Also =

;[[The CSI Multimedia Architecture]], IEEE T. VLSI, Jan 2005.
: [[Stream descriptor]]s, plus optimizations to allow implicit SIMDification.
* [[iWarp]] - stream descriptors bound to registers
* [[Wm]] - William Wulf's processor, firs instance I know of stream descriptors bound to register

No comments: