The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Thursday, March 31, 2011

Hardware data structures

Computer software education (at least, if not the whole field) was greatly influenced by Niklaus Wirth's 1976 book
"Algorithms + Data Structures = Programs".
This popularized the importance of a programmer or software practitioner having a toolkit of datastructures,
and algorithms that employ or operate on such datastructures.

The situation in computer hardware is somewhat inchoate, less well defined.
Nevertheless the same [[hardware data structures]]
crop up over and over again in different applications.

I am not really sure what "algorithms" would be in the hardware version of "Algorithms + Data Structures = Programs".
Nor, for that matter, what "programs" would be.
Modules? Devices?
Sure, hardware implements algorithms, but it is hard to tease apart the concept of algorithms and data structures.
Moderately hard for software; especially so for hardware.

However, we should always keep in mind that hardware "algorithms" are just like software algorithms,
and data structures, except that implementation in hardware has diofferent constraint:
* size is limited - software often treats program size and data memory suze as infinite
* communications is restricted: whereas in software any module can talk to any other module or object, in hardware only a limited number of nearby communications are cheap.

== List of [[hardware data structures]] ==
* [[Arrays]]
** [[RAMs]] versus [[CAMs]]
** [[Encoded CAMs versus Decoded CAMs]]
* [[CAM]]s
** [[Equality match CAM]]s
** [[Priority CAM]]s
** [[Greater than CAM]]s
** [[Range CAM]]s
** [[Bitmask CAM]]s, a more general case of [[Ternary CAM]]s
* [[Pipelines]] - I'm not sure that these should be considered a [[hardware data structure]]
** [[Stalling pipelines]]
** [[Stall-free pipelines]]
*** [[Draining pipelines]]
*** [[Replay pipelines]]
** [[Bubble collapsing pipelines]]
* [[Queues and Buffers]]
** [[Collapsing queues]]
* [[Circular arrays]]
* [[FIFOs]]
** [[True data movement FIFOs versus array-based circular buffers]]
** [[Synchronizers (asynchronous logic)]]
*** [[Synchronization and clock domain crossing FIFOs]]
**** [[AMD synchronization FIFO]]
**** [[Intel Bubble Generating FIFO (BGF)]]
** [[Alignment issues for queues, buffers, and pipelines]]
** [[Holes in queues, buffers, and pipelines]]
* [[Fully associative in-array CAM tag matching]]
* [[Separate tag and data arrays]]
* [[N-way associative tag matching outside array]]
** [[index sequential]] - tag match, and then index data array
** [[index parallel]] - read out tags and data in parallel, then do a late mux
** [[in-array tag matching for N-way associative arrays]]
* [[Bloom Filters]]
** [[M of N Bloom filter]]
** [[N x 1 of M Bloom filter]]
** [[Pipelined Bloom filter]]
** [[Counted Bloom filter]]
* [[Arrays of 2-bit saturating counters]] crop up again and again, in predictors and elsewhere
* [[Permutation bit matrix]]es
* Schedulers
** [[Pickers versus packers]]
** [[Prioritizers]]
*** [[Carry chain prioritizer]]
**** [[Pick most extreme on one side]] - [[pick oldest (or youngest)]]
**** [[Pick most extreme on both sides]] - [[pick oldest and youngest]]
*** [[CAM matching prioritizer]]
**** [[Pick N-th]] - [[pick N-th oldest]]
*** [[Prioritizers after permutation bit matrix]]

In the aftermath of [[Mead and Conway]]'s "VLSI revolution"
there was a flurry of work on arbitrary hardware datastructures,
often 1:1 corresponding to software datastructures.
These include
* [[hardware sorting networks]]
* [[hardware heap sorting structures]]
* [[true hardware stack]]s
* [[true hardware FIFO]]s
Unfortunately, many of these have not proved practical in most computer applications.

== Duality between Hardware and Software Data Structures ==

We have long observed a regular pattern of [[duality between hardware and software data structures]].

This can be observed
* When implementing a hardware microarchitecture inside a software simulator.
* When converting an algorithm implemented in software into hardware
Oftentimes the most direct implementatin of a hardware data structure in software is stupidly slow, but there are regular patterns of good equivalences.

Such duals include

Hardware Software Comments

Equality match CAM

Hash table

Range or Greater Than CAM

No good SW equivalent


[[Lookup table (LUT)]]

[[HW]] can handle irregular lookup tables:
different granularity of LUTs for different ranges.
[[SW]] must

The duality is useful,
not just for implementing HW in SW or vice versa,
but also because trends in VLSI scaling have meant, more and more, that the "traditional" hardware datastructures
must be foregone.
E.g. limiting the number of bits on a bit-line in a [[RAM array]] or a [[CAM match bitline]]
often means that that hash table like approaches should be used.

== Dual or equivalent implementations for [[hardware data structures]] ==

* [[Arrays are equivalent to multiple flopped registers outputting to a MUX]]

== Other [[hardware data structures]] ==

The ShuffleBox: a rectangular array of bits that can store, XOR and rotate all bits in all directions.
* Hala A. Farouk
* An efficient hardware data structure for cryptographic applications
* http://www.universitypress.org.uk/journals/cc/cc-10.pdf

=== Shifters ===

Various shift implementations - [[barrel shifter]]s,
[[cascaded multistage shifter]]s
- "feel" to me like the hardware instantiation of an algorithm more than a data structure,
since they are stateless.

Perhaps I should limit the concept of a [[hardware data structure]] to something that can store state for retrieval later.

= References =
* http://en.wikipedia.org/wiki/Algorithms_%2B_Data_Structures_%3D_Programs
* Wirth, Niklaus (1976) (in English). Algorithms + Data Structures = Programs. Prentice Hall. ISBN 978-0-13-022418-7. 0130224189.

Intel BGF mentioned in:
* ftp://download.intel.com/design/processor/datashts/320835.pdf

* http://en.wikipedia.org/wiki/Mead_%26_Conway_revolution

Wednesday, March 02, 2011

Dynamic dead code elimination and hardware futures

= [[Dynamic dead code elimination]] =

It has been shown that often more than 1/3 of all computations, data movement, etc., are dynamically dead - they will never be used by the program, on its current path of execution.

There have been hardware proposals that delay computations, and elide some such dynamically dead code. Something like the flip side of [[hardware futures]].

; [[Semi-static optimization]]

Dynamic dead code elimination may be done as a [[semi-static optimization]] in an [[optimized trace cache]].
Basically, one applies classic compiler algorithms to the code to be cached,
augmented by reachability predicates.

Or, another classic way of doing hardware optimization is to examine the post-retirement instruction stream, eliminate dead code, and install that in the [[optimized trace cache]].

; [[Rename time optimization]]

Dynamic dead code elimination may also be done in the [[renamer]] pipestage. The basic idea is, instead of emitting uops as the are decoded,one assembles blocks of uops in the renamer.
Essentially, large expressions, potentially with multiple inputs and outputs.
One does not emit such a block of uops from the renamer to execution until a usage of the block is detected.

Typically, a store that may be visible to another processor constitutes a usage that will cause uops to be emitted.
(Although certain stores can be elided, e.g. push/pop/push [[stack optimizations]] and [[temporary variable optimizations]].)

Similarly, the evaluation of a branch condition may cause uops to be emitted. But one only need emit the subset of the buffered uops necessary to evaluate the branch.

Exceptions and other situations whether the entire program state may cause uops to be emitted.

And overflow of the renamer buffer.

If the output of an expression buffered in the renamer is overwritten, that part of the expression can be eliminated as dead code.

; [[Dataflow optimization]]

[[Dynamic dead code elimination]] can also be done in the OOO dataflow scheduling part of the machine.
Essentially, one assembles the dataflow graph, but one does not necessarily start execution as soon as inputs are ready.

The renamer adds an additional [[WAW]] arc to the dataflow graph, for a new uop that overwrites an old uop's output.
If that old uop has no [[RAW]] dependencies, then it can "fire", cancelling itself,
and signalling that cancellation to its own input dependencies in turn.

This is essentially [[reverse dataflow execution]].
It is straightforward, although it does require hardware that is not normally present.
It is easiest to accomplish with a [[bit matrix]] scheduler, where the [[input bitvector]] becomes the [[output bitvector]] of a cancelled uop.

It is not at all clear how useful this form of dataflow dead code elimination is:
it is buildable, but how much does it help?
It is probably most useful if it can be remembered in an [[optimized trace cache]].

= [[Hardware futures]] =

[[Hardware futures]] are the dual of [[dynamic dead code elimination]].
Instead of eliminating code that does not need to be executed,
it delays execution and evaluation of code until it is known to be needed.
This delay naturally accomplishes [[dynamic dead code elimination]].
It also provides increased scope for [[hardware CSE]],
and other optimizations.
But it potentially increases latency.

As for [[dynamic dead code elimination]],
it is fairly easy to see how to do this at the renamer:
have the renamer accumulate expressions.
Only release an expression to be evaluated when a usage that demands evaluation is detected.

We can go further, and avoid the buffering at the renamer becoming an issue by emitting such buffered instructions for a hardware future
into a sort of side car storage, rather like that of Akkary's [[CFP]].
When evaluation is demanded, we issue a token to cause insertion of those instructions.

Issue: are the buffered instructions pre or post rename? It is easiest to buffer them post rename,
using dataflow single assignment registers and memory locations.

Also as above, this can be deferred into the OOO dataflow scheduler.
But, again, that is using a fairly critical resource.

To make hardware futures less dynamic, more [[semi-static]] at the [[optimized trace cache]]
involves dividing the future into two parts:
the first part captures any live inputs,
that must be captured lest they change;
the second part does the actual evaluation on demand.
Typically memory load values must be live inputs, as well as registers,
and the future is constrained from doing stores that may be visible to other threads.

I.e. the hardware future is constrained to be a chunk of code that has no internal memory references.
Or at least only sharply limited internal memory references.

It is unclear how useful such a future would be. It would have to be saving considerable register based computation
- complicated instructions such as transcendentals - or be a complicated calculation on a small amount of memory.

Such hardware futures might be much more useful if we were released from the constraints of the [[Von Neumann memory model]],
e.g. if we had classic dataflow style [[single assignment memory]].
But in 2011 that does not seem likely any time soon.

Using Redundancies to Find Errors

Cleaning out old journals from my bookshelf.

Came across

Yichen Xie, Dawson Engler, "Using Redundancies to Find Errors," IEEE Transactions on Software Engineering, pp. 915-928, October, 2003

(actually, a slightly different version in Software Engineering Notes / SIGSOFT 2002/FSE 10.

Very interesting paper: used analysis tools that detected dead code, redundant conditionals, idempotent operations such as x&x to detect bugs.

Causes me to wonder about applying similar techniques to look for hardware/RTL/HDL bugs, e.g. in Verilog or VHDL.

For that matter, it caused me to think about similar hardware techniques. For example, it has been shown that often more than 1/3 of all computations, data movement, etc., are dynamically dead - they will never be used by the program, on its current path of execution. There have been hardware proposals that delay computations, and elide some such dynamically dead code. Something like the flip side of hardware futures.

Now, such dynamic dead code detection *might* help locate bugs - but it suffers because it is dynamic, bound to the current path of execution. Whereas Xie's work looks at reachable code, including paths that may not be currently taken but which might use the values that are dead on the current path.


Done in xgcc, so code should be publicly available.

I wish I had had such tools in my last project.

Tuesday, March 01, 2011

Stream Processors, specifically Stream Descriptor Processors

{{Terminology Term}}

= [[Stream processor]] is an overloaded term =

The term [[stream processor]] is fashionable and overloaded.
It often means just a processor that is "optimized" for a [[stream workload]].
Where "optimized" may mean just "has little cache".
(E.g. the use of the term "stream processor" in [[GPU]]s such as tyhose by AMD/ATI and Nvidia.)

= What is a [[stream]]? =

A [[stream workload]] operates on sets of data,
operating on one or a few data elements
before moving on,
not to return to those data elements for a long time.
Because of this access pattern,
[[LRU replacement]] is often ineffective, and may even be pessimal,
for [[stream workload]]s.
Keeping cache entries around that are not going to be reused is wasteful.
Sometimes a stream is restarted, in which case an MRU policy that keeps the head of the stream in cache is better than LRU.
Sometimes there is no reuse at all.

= [[Stream descriptor processor]] =

By [[stream processor]], I mean processors with [[stream descriptor]]s
- processors that have specific features, whether ISA or microarchitecture,
that support streaming.
(I would prefer to reserve this term for ISA features, but [[many good ISA features become microarchitecture]]].)
Possibly should call them [[stream descriptor processor]]s,
to distinguish them from the overloaded term [[stream processor]].

= [[Stream descriptor]] =

A [[stream descriptor]] is a description, in a register or memory data structure, that describes a memory access pattern.

For example, a [[1D stream descriptor]] might include
* data item description - size, type
* base
* stride
* length
* current position

A [[2D stream descriptor]] might include
* data item description - size, type
* base
* horizontal stride, length
* vertical stride, length
* current position, row and column

Obviously extendable to N dimensions.

In addition to arrays, there may be other types of stream descriptor

A scatter/gather descriptor
* a 1D descriptor of a vector of addresses
* with an indication that each addressshould be dereferenced

A linked list descriptor
* Base
* a skeleton [[structure descriptor]]
** payload to be accessed while traversing the list - offset, type
** position within a list element of pointer to next element

A tree or B-tree descriptor
* Base
* a skeleton [[structure descriptor]]
** payload
** key
** pointers

For the case of a tree or B-tree descriptor,
the access pattern may
(a) traverse all of a tree or subtree
or (b) traverse only a path seeking an individual element

= Why [[stream processor]]s? =

There are two main motivations for stream processors:
* memory access patterns
* data manipulation overhead
I think there is an additional property, probably nort considered important by most, but [[IMHO]] quite important:
* decoupling and ease of programming

; Memory Access Patterns

Support for streaming memory access patterns is probably the most important.
As mentioned above, many [[stream workload]]s do not fit in the cache.
But that could be handled by a simple [[cache bypass hint]] instruction.

Of more long term interest is the possibility that [[stream descriptor]]s, etc., may allow
streams that fit in the cache to be managed.
As cache sizes grow, [[memory data structures]] that did not fit in an old, smaller, cache, may fit in a new, larger, cacge.
E.g. a 1000x1000x32b array does not fit in a 32K [[L1$]], but may fit in an 8M [[L3$]].
Of course, this cannot be managed in isolation:
a workload that accessed only one 2MB 2D array might fit in an 8MB cache, but one that accessed 4 such would not.
Nevertheless, explicitly indicating the data structures provides more information to whoever is trying to manage the cache,
whether hardware or software.
    The [[stream descriptor]] may have a bit associated with it, either inside or [[non-adjacent]], that indicates which cache levels to use.
Furthermore, even in the absence of cache control in the stream descriptor, the stream descriptor provides guidance for [[cache prefetch]].
Guidance that may be more accurate that [[SW prefetch using cache control instructions]].
A stream descriptor may be associated with a prefetch control data structure, that seeks always to avoid delays accessing a stream,
without excessive prefetch.

; Data Manipulation Overhead

Users of [[SIMD packed vector]] instruction sets quickly come to realize
that the actual memory access and computation is only a small part of the task of programming code to use such instructions.
Often more than half of the instructions in such kernels are
data manipulation overhead
* unpacking 8 bit numbers into 16 bit numbers, or, worse, even fancier unpackings such as 5:6:5
* transposng rows and columns
* [[chunky versus planar]] data, i.e. [[SOA versus AOS]]

Explicit support for streams via [[stream descriptor]]s can eliminate much of this overhead, avoiding explicit instructions.

[[The CSI Multimedia Architecture]] goes further: not only does it hide the data packing and unpacking,
but it also seeks to hide the [[SIMD packed vector]] nature of the hardware datapath.
Instead of the programmer loading from a stream into a register of N bits, operating, and then storing,
this paper proposes complex streaming instructions that are implemented N bits at a time
in such a way that the code does not need to be changed when the microarchitecture changes the width of N.

Others have attempted to go even further, providing ways of expressing streaming kernels using simple instructions
that automatically get mapped onto a fixed length datapath.

; Decoupling and ease of programming

Streams are the equivalent of UNIX pipes.

If you had an architecture that allowed [[lightweight thread forking]],
some code could be written by connecting two library function kernels together by a stream
- instead of
* writing two separate loops that store the intermediate value to memory
* or merging the two operations into a single loop for a single pass

PROBLEM: allocating buffer space for such inter-thread streams. It may be necessary to spill to memory in some situations.

= If streams are such a good idea, why aren;t they popular yet? =


Proper [[stream descriptor processor]]s have not taken off by 2010.

Instead, [[SIMD packed vector]] has dominated.

= See Also =

;[[The CSI Multimedia Architecture]], IEEE T. VLSI, Jan 2005.
: [[Stream descriptor]]s, plus optimizations to allow implicit SIMDification.
* [[iWarp]] - stream descriptors bound to registers
* [[Wm]] - William Wulf's processor, firs instance I know of stream descriptors bound to register