The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, September 21, 2010


= Packed =

For want of a better term, I will use the phrase [[Packed Operations]] to refer to instruction set extensions such as Intel MMX, SSE, PowerPC Altivec, etc.:
typically operating on 64 or 128 bit data (although people did propose [[32 bit MMX]], and 256 bit AVX is coming soon in 2010).

Typically operating on
* 8x8b or 16x8b
* 4x16b or 8x16b
* 2x32b or 4x32b

Typically operating SIMD-style inside ALU operations,
but typically lacking scatter/gather memory operations.

= Packed vs Vector Lanes =

I want to contrast this with microarchitectures that have parallel vector lanes.
(NOT [[vector sequential]] microarchitectures.)

Packed Operations are fundamentally scalar pipelines. They just happen to have wide scalar datapaths, e.g. 64bits.
Since there are typically no 128 bit scalar datapaths, they have evolved beyond scalar;
but typically the datapath is designed as if it is a wide ALU datapath.

Parallel vector lane microarchitectures, on the other hand, are typically formed ourt of multiple indeoendent ALUs.

Packed ALUs are typically a single ALU, that is subdivided to perform independent packed operations.
E.g. one might take a 64-bit wide adder, and add special buffer bits at every byte boundary.
But controlling the values of these buffer bits, we can control whether carry will propagate across 8b, 16b, or 32b boundaries.
I.e. a 64 bit wide ALU with such byte boundary buffer bits
is really a 72 bit wide ALU,
capable of performing 8x8b, 4x16b, and 2x32b, as well as 64 bit addition.
(For that matter, odd sizes such as 24bit arithmetic can also be achieved.)

Similarly for 128 bit packed datapaths. Although nobody probably wants to build a 128 bit wide carry propagate adder.

Whereas in a vector lane microarchitecture, the ALUs are independent.
E.g, in 512 bit wide vector machine with 16 32b lanes,
there are 16 32b wide adders.
In such a machine crossing vector lanes is difficult.

Of course there is a spectrum of possibilities.
For example, pairs of even and odd vector lanes may be combined to perform double precision arithmetic.
Or, one may have vector lanes, where the operations within the lanes are 128b packed operations.

(128b packed is particularly pleasant as a lane width, since it supports the common 4x32b vectors, as well as 2x64b complex.)

Even on a packed datapath, certain operations that move data between bits are difficult.
E.g. shifts require special handling of buffer bits.
It is more difficult to partition a 64x64bit integer multiplier into 32b*32b multipliers,
than it is to simply add buffer bits for add - nevertheless, it can be done.

(Note that a 64b*64b integer multiplier can be partitioned into 4 32b*32b multipliers - roughly what is required for complex multiplication or an FFT 2-butterfly.)

Floating point is even more difficult to partition. The multiplier array for an FMA can be shared between 32b and 64b.
The adder logic is more challenging.

Neverthelerss: considerable sharing is possiblwe on a packed datapath.
But at some point one graduates to separate ALUs for vector lanes.

In "separate ALUs", the operative term is "separated". Physically separated ALUs are part of a vector lane microarchitecture.

= Instruction Set Design Consequences =

In a packed architecture, one tends to have N-bits, subdivided into more or viewer smaller pieces.

In vector architectures, one tends to have vector elements of a given width, say VL * EW, where EW is the element width.

One might then have narrower operations that apply to vector elements,
e.g. a vector of VL 32b floats or a vector of VL 64b floats.

(Although one might have packed operations on the lane width vector elements.)

There may be operations that load a vector of 8, 16, or 32 bit elements,
packed in memory, into unpacked 64 bit elements within a vector register.
Whereas on a packed instruction set, one probably just loads the values without unpacking them.

Consider, e.g. 16b wide floating point.
Many machines expand it into 32b floating point in registers.
But one can imagine ALUs that operate on 16b FP in registers.

I.e. packed operations are about using all of the bits of the register datapath.
Whereas vector lane operations are about vector operations on a limited number of datatypes.

Wifi disppears from HP TX 1420 tablet PC

My wife and daughter's HP TX 1420 Tablet PC (originally my wife's, then my daughter's) suddenly behaves as if it no longer has wifi builtin.

"Suddenly" may not be true - it has probably been attached to wired ethernet for a few months. I don't know when wifi disappeared. But it has gone now.

As the link shows, this is apparently a well known problem with HP machines.

Tuesday, September 07, 2010

Varieties of Cache - CompArch

Varieties of Cache - CompArch: "Varieties of Cache"

The basic principle of caching is to use a small, fast structure (typically an [[Array]])
improve the performance of a large slow, structure,
hopefully to approximate a large, fast, structure, without the expense.

However, the definition of "fast" may vary. And there may be other reasons to use a cache.

The most familiar caches are [[latency caches]].
E.g. an L1 data cache that may take 4 cycles to access, an L2 20 cycles, and main memory 200 cyckes.

Another useful cache is a [[bandwidth cache]], where the cache provides more bandwidth than the structure is is a cache for.
This is quite common - often latency and bandwidth caches are combined.
However, the use of caches to improve bandwidth sometimes leaves customers dispirited,
such as those on the [[High bandwidth Computing]] mailing list.

Another useful cache has many array read and write ports on the small cache, and fewer on the large structure. E.g. [[array port reduction]].
This is really a particular form of [[bandwidth cache]].

A cache is a fairly good [[read combining]] or [[write combining]] mechanism.

A [[power cache]] attempts to reduce power by keeping frequently accessed data items in the small structure.
this may be done even without an improvement in latency or bandwidth.

Finally(?), a reliability cache may keep certain items expected to have relaibility issues in the cache.
Particularly if reliability is related to frequency of access.

TBD: are there any other varieties of cache?

Array Port Reduction - CompArch

Array Port Reduction - CompArch: "Array Port Reduction"

AFAIK "RF Port Reduction" was invented by Dave Papworth on the Intel P6 circa 1991. At that time called "ROB Read Port Reduction". Based on observation that most uops' inputs were captured by the RS, while the ROB read ports were actually

* [[Register File Port Reduction]]
* [[Register Renamer Port Reduction]]

After the above, I note some generic trends:

Say that you have N operations, each with L inputs and K outputs. How do you avoid having to access
the array containing such inputs and outputs with N*L read ports and N*K write ports?

I observe that these techniques apply, not just to the RF and RR arrays mentioned above, bit also to other arrays:
* Generic Cache Arrays
** e.g. banking as a form of [[pseudo dual porting]]
* RF arrays
* [[Store Buffers]]
* [[Branch Predictor Arrays]], in particular [[BTB Arrays]]
* Nearly any large [[Queue or Buffer]], such as those in store and forward networks such as the memory hierarchy.
** See [[The Memory Hierarchy is a Store and Forward Network]]

Similar principles of array port reduction apply to all of these arrays.
Simplest is when the arrays are indexed solely and uniquely, e.g. by a memory address or register number.
One might say that these are the primary key.
More challenging is when the array has such a *spatial* index key,
but also has a *temporal* aspect.
E.g. "find the youngest store to the same address older than the present store".

Spatial/temporal may well be a useful distinction, although two dimensional X/Y may also apply.

= Bypassing =

The first principle is as mentioned in the examples of RF port reduction and renamer port reduction:
that group of N operations may not really need to do N*L reads and N*K writes.
Since we are often talking about instructions, some instructions may produce values that later instructions consume. I call this generically "bypassing": it may take the form of conventional execution unit bypassing, or dependency checks inside a block of instructions at the renamer. The P6 [[Capture RS]] that led to Papworth's original ROB Read Port Reduction is a form of [[bypass cache]] in some sense.

= Banking =

Classic spatial banking: banks in a pseudo-multiported cache. Banks in a large register file, such as a GPU.

Banking often seeks to do the opposite of [[spatial locality]] - you try to arrange the banking function, usually a bitfield extraction, occasionally a hash - so that locations likely to occur together are likely to occur in different banks, not the same.

Temporal banking - or otherwise exploiting temporal proximity. E.g. as described in [[Register Renamer Port Reduction]], where uops' inputs are compared to neighbouring older uops first, and then to more distant uops only if the recent predecessor comparison fails.

If these comparisons are done in batches with hard boundaries, it is a form of temporal banking.
If these comparisons are done smoothly - all neighbous compared, etc. - then it is a different way of exploiting temporal proximity, not banking.
Intermediate between these is to divide the temporally sorted array into banks,
and to always compare an element to two or three banks:
* 3 banks: the bank the element is in, the preceding and succeeding bank.
* 2 banks:
** the bank the unit is in
** the preceding bank if the element is close to the beginning of the bank
** the succeeding bank if the element is close to the end of the bank
** possibly no other bank, if the element is in the middle of its own bank.

    E.g. based on the two most significant bits of the sequence number
    00bb...b => compare to this bank plus preceding
    11bb...b => compare to this bank plus succeeding
    01bb...b OR 10bb...b => don't compare
    or 3 bits ...

    [[AG Novelty?]] - I've never seen this proposed before. It seems so obvious...

= Read and Write Combining =

TBD: combining circuits:

;(1) CAMs: compare all operations to all operations

;(2) Choose N oldest (or highest priority) unique. Then CAM for combining.

;(3) Spatial hash to queues.
Within the queues
* or just compare nearest neighbours

;(4) Sorting networking

;Sort by index.
Choose N lowest.
ISSUE: loss of age or priority info

;Sort by index, spatially
;Also sort by priority, e.g. time.
(Maybe different sorts, time and priority)

Spatial sort is of tuples (index,seqnum).
When matching indexes (index,seqnum1), (index,seqnum2) link seqnum2 to seqnum1,
and only keep (index,seqnum1) in the sorting datastructure.

Equivalent, keep redundant values in the temporal sort.

Compute dominators in the temporal sort. Extract the N highest priority dominatos in the temporal sort.

Sounds expensive. Less expensive time-wise if thinking about special sorting hardware.


= Caches =

Small highly ported array,
combined with a larger, less ported array.

May exploit temporal and/or spatial locality.

E.g. may have 1r/1w port on an array of large cache line of 64 bytes,
but have more, narrower, ports within a smaller array.

May use such a cache not to improve latency, but solely to manage ports.
Which are a special form of bandwidth cache. (See [[Varieties of Cache]].)

= TBD =

Q: what other forms of array port reduction have I encountered?

Register Renamer Port Reduction - CompArch

Register Renamer Port Reduction - CompArch: "Register Renamer Port Reduction"

On comp.arch circa August 2010, Mitch Alsup a little while back said something like "It isn't the ports on the register file that are a problem, it's the ports on the renamer." With, IIRC, a comment that you had to squeeze the renamer into a single pipestage.

Not true. This reminds me of a comment by Tim Olson, then of AMD (and active on comp.arch), when I presented HaRRM, my Hardware Register Renaming Mechanism (which is basically the modern form of renaming) to him. He said that it would lose because it required an extra pipestage.

First, renamer ports: the renamer is a much smaller structure than the PRF, in terms of number of entries times number of bits per entry. This means that the wires involved are shorter, even though the logic depth, Mitch's preferred metric in most of his posts, is similar. Wires are a major contributor to delay.

Second, "port reduction" techniques can be applied to the renamer just as they can be applied to the register file. True, out-of-order execution may allow the execution units to pick up a higher fraction of their inputs on bypasses, whereas the renamer is essentially an in-order pipestage, and in-order means less port reduction. Nevertheless, there's a lot to latch on to.

= Components of Register Renaming =

Register renaming has three main parts:

a) bypassing: in a block of instructions, comparing the output registers of older instructions to the input registers of younger instructions

b) lookup: if a register is live-in to a block, i.e. if it is not bypassed, then looking it up in the renamer/map table.

c) allocation: if a register is written within the block, giving it a new physical register (and updating the array); also, arranging for this newly allocated register to be used as inputs by younger instructions in the block

= Lookup =

Lookup is the part that most people think of as accessing a highly ported array. E.g. if you are renaming 4 instructions per cycle, with 2 inputs, you might be looking up 8 inputs per cycle, and writing 4 new outputs.

But observe that whenever there is bypassing you don't need lookups.

For example, consider the code sequence

ADD eax += M[ecx+edx]
tmp := load(ecx+edx)
eax := eax + tmp
CMP eax,ebx
JNE target

The above group of 4 uops only has 4 live-ins, not 8. Only 4 values need to be looked up in the RAT array. The remaining 3 input values (tmp, eax in the CMP, JNE) are produced by earlier uops in the same block.

True, P6 brute forced this: we renamed three uops per cycle, we always looked up 6, one for each register operand - and we threw away any lookup that was bypassed within the block. This avoided the need to first do the bypassing, and then

However, even before P6, when I was working on this form of register renaming at UIUC during my MSEE, I anticipated ways to avoid this. Of course, that was back when I was trying to create 16-wide superscalar machines, so reducing complexity was highly required.

== Non-Blocking Limited Port Register Renaming ==

IMHO the most elegant way is similar to what is also IMHO the most elegant way to do RF read port reduction:

1) do the bypass comparisons of uop output to uop input earlier, possibly as a separate pipestage.

2) out of this previous work, generate a list of live-ins that need to be looked up. Have a limited number of lookup ports. Of course use port combining, so multiple reads of the same register live-in only get looked up once - although possibly only after some time.

The question is, what happens when you run out of lookup ports? You could do an in-order stall. But, being me, I like to let later instructions that are not dependent on one of the blocked lookups get renamed and proceed to the scheduler. I.e. I want to do out of order renaming. Actually, I think the easiest way to do this is to let the guys blocked on a renamer lookup port proceed to the scheduler, but to have a separate write port that fills in the physical register input later. This is not actually out-of-order: all uops are renamed in-order, but some are immediately renamed to physical registers, while others are renamed to placeholders that later get filled in with physical register numbers. I.e. the renaming is in-order, but the renaming all the way to physical registers may be out of order. (In my most BS-full moods I have imagined that the renaming is actually done out of an OOO scheduler, since even the bypass comparisons might be done with a non-full network, as will be discussed next.)

= Caching Renamed Instructions =

And, oh yes: it is trivial to imagine caching the renames, the results of the bypass comparisons, in a decoded instruction cache / trace cache. That is one of the reasons I worked on trace cache, way back then. You get blocks of instructions - possibly fixed size blocks, possibly variable length traces - with N live-ins that need to be looked up, and M internally bypassed values that are given relative numbers, i.e. "use the K-th register that is allocated to this block."

= Bypassing Renames Inside Instruction Blocks =

Now, how about the bypassing? Apart from the
caching thereof, mentioned above.
Naively, that's O(N^2) comparators:
1st uop
- lookup all of its inputs (some combining)
- compare its output to all N-1 younger uop's inputs
2nd up
- if its inputs are not bypassed from uop #1
lookup all of its inputs (some combining)
- compare its output to all N-2 younger uop's inputs

Dana Henry and Brad Kuzsmaul's Ultrascalar showed that you can do this in O(lg N) work. But that's asymptotic, for really big machines.

== Per Logical Register Renaming Circuit ==

Here's another way to do bypassing, suitable for machines with small register sets:
* create a "carry chain" for each logical register
* 2 bits, states "not used", "written", "needs to be looked up"
* insert not used at the oldest instruction
* pull down whenever written by a uop, in the sense
older next
nu nu => nu
lu * => lu
w * => w

This is O(N) delay, O(N*L) hardware, where N= number of uops, L=number of logical registers. It suffices to tell you what registers are live-in, and need to be looked up.

If you want, you can make the values inserted be "written by uop #k",
and thereby establish the dataflow. Increases the hardware cost by a factor, i.e. multiplying, by *log2(N) bits. You take those sequence numbers, and lookup a queue of free registers, and/or add to a base (the latter if doing a P6-style circular allocation in a ROB).

(I've heard that some companies did freelists as CAMs. Enough so that they veered away from freelists in subsequent processors. Silly people.)

Of course, O(N*L) is not a win over O(N^2) unless L is small. Which it might have been, back when I imagined doing this for a 16-wide x86. But which is no longer, now that N seems to be stuck around 4, and L is 32 or more.

== Incomplete Comparisons ==

Note that you can also do an incomplete comparison: e.g. compare most uops to their immediate predecessor, which catches many dependencies. Queue up requests to do comparisons that are further apart on a smaller set of comparators.

Monday, September 06, 2010

Ulfs - my user level filesystem - Andy Glew's Wiki

Ulfs - my user level filesystem - Andy Glew's Wiki: "Ulfs - my user level filesystem"

ulfs - my User Level FileSystem.

wiki: http://wiki.andy.glew.ca/wiki/Ulfs_-_my_user_level_filesystem

blog: ...

Have wanted to code this up for a long time.
Never could, because anything I did belonged to my employer.
Want it open source, so that I can use it for my personal stuff - wikis, version control.
With my current employment agreement I can work on open source, so starting.

Today got the most trivial put, get, and delete coded & tested.
Not much, but its a start.

I want a user level filesystem, with at least two bindings
* UNIX command line
* Perl
I imagine that I will eventually want python, javascript, etc.
TBD: generate such via swig?

I want command line, because I'm a UNIX old fart. I still think of the shell as a good way to compose operations, e.g. via pipes.

I want scripting languages like Perl, because much code will be in there.
But mainly because I will implement at least first version in Perl.

Storage types:
* ordinary UNIX (or Windows) hierarchical filesystem tree
* archive formats, like tar
* possibly VC tools, like git, hg, bzr, ...
** delta storage
* possibly databases, even relational
* possibly hybrid filesystem/database

: any of the above may have flavors - e.g. will have metadata in the ordinary filesystem

Filesystem Names
* use the name specified by the user
* allocate and return a name
** sequentially
** content based hashing
*** handling collisions
* metadata queries
:: all of the above, both explicit, implicit, and content based
* versioning

Command Line Interface:

ulfs -mkfs test.fs -type ordinary
ulfs -mkfs test.fs.tar -type tar

my $ulfs = ULFS_Ordinary::mkfs("test.fs");
my $ulfs = ULFS_Tar::mkfs("test.tar.fs");

my $ulfs = ULFS::openfs("test.fs");

basic operations
ulfs -fs test.fs -to testdata1-in-fs < testdata1 -put
ulfs -fs test.fs -from testdata1-in-fs > testdata1.retrieved -get
ulfs -fs test.fs -from testdata1-in-fs -delete


other operations

As is my wont, some thoughts:

If I try to use this as a net or web based filesystem, I may want to have predicates specified by the client
executed by the server. E.g. compare-and-swap: put this file, but only if the old version's content is ...

Operations like rmdir are really simple predicates: only remove the name if that name is a directory.

Friday, September 03, 2010

apenwarr - Business is Programming

apenwarr - Business is Programming: "The sad evolution of wikis"


Wikipedia is not wiki.

But... I'm contributing. I want comp-arch.net to be a controlled wiki - in part because I've already been spammed, but also because I eventually want to make it into a book. Or at least the wiki equivalent of one.

Also because i have specific contractual requirements. Certain things i can write now, but not publish for a few years. Because I'm lazy, I want to write them on the wiki now, and have them go public later.

Hence my half-assed semipublic/public comp-arch.net. And my dreams of a more complete solution.

The original poster said:

How do you create a vibrant community, but allow for private topics and discussion, but allow for public topics and discussion, and allow me to work for more than one company at a time with multiple private discussions, and have my WikiWords always end up pointing where they're supposed to?

I have no idea.

I actually have quite a few ideas along this direction.

I actually have arranged, in my employment contract, the legal right to work on such, even while employed.

Unfortunately, getting the time to work on this is my problem. What time I have, I tend to spend on comp.arch or comp-arch.net. I.e. content. Improving infrastructure requires blocks of time I only get on vacation.