Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Thursday, October 07, 2010

Accumulator ISA vs Uarch

wiki at: http://semipublic.comp-arch.net/wiki/Accumulator_ISA_vs_Uarch

oldder article on accumulators:
http://semipublic.comp-arch.net/wiki/Accumulators

I'm at home sick today, so I've got to be careful: last time I posted on comp.arch when I was home sick, I set the agenda for 4+ years of work.

First: 22+ gates may well fit in a pipestage. People are headed past 28 FO4, in an effort to tolerate device variation (the more gates / pipestage, the higher your yield).

Of course, usually when you do this we end up trying to cram several pipestages worth of work into one. As you might say redundant bypassing does.

---

As for accumulator architectures and microarchitectures, love/hate.

First, we must distinguish instruction set architecture from microarchitecture.

= Accumulator ISA =

Accumulator ISAs save bits. May lead to very small code. When we have the ability to reorganize instructions over long distances, accumulator ISAs don't need to hurt performance.

By the way, some might say that "2 operand" ISAs are accumulator based:

rD += rS

Myself, although more accumulator like that three operand (I hate these terms, but must use them)

rD := rS1 + rS2

I tend to be a stickler, and say that either there is single accumulator

acc += rS

or maybe a small number of accumulators, acc0 - 3 versus r0-32

acc0-3 += rS0-31

== Accumulator versus Stack ==

By the way, for similar instruction size issues, I sometimes look fondly on stack ISAs. Even fewer bits than accumulators.

= Accumulator Microarchitecture =

Now, let's transition into microarchitecture. I call it an accumulator microarchitecture when the accumulator (or acccumulators) are "close to" one or more ALUs.

I.e. you may have a small number, say 4, of accumulators near to every individual ALU. E.g. you might have 3 ALUs, with 4 accumulators each.

Or you may have a small number of accumulators shared between a group of ALUs.

Physically, the term accumulator may be associated with a different circuit design: e.g. the accumulators might be flops, whereas the registers might be in an SRAM array.

I don't call it an accumulator microarchitecture when you have, say, 2 clusters of 4 ALUs, each tightly bound to 16 registers. I call that a partitioned register file. But I might be willing to call it an accumulator if there were only 4 such tightly bound registers next to the ALUs. Obviously, this is a matter of degree.

I might call it an accumulator architecture if there were 2 such ALU clusters, each with 4 accumulator registers, and a big shared pool of 64 registers eqaually accessible to all. But I would require that there
be instructions of the form

cluster.accum_i += global_reg_j

If the global registers could not be used as operands of at least the usual ADD instruction, but were restricted to moving to and from the accumulators, I would call it a two level register file, not an accumulator microarchitecture.

== OOO implementation of Accumulator ISA ==

Accumulator ISAs implemented on accumulator microarchitectures are natiral for in-order machines.

On an out-of-order machine, it is trivial to rename the accumulators of the ISA to a single PRF. In which case, the hypothetical advantages of the accumulator ISA are lost.

== OOO implementation of Accumulator Microarchitecture ==

Conversely, it is straightforward to implement a "bypass cache" for a flat register file instruction set, taking advantage of the fact that a value need not be written into the main register file before it is reused. A bypass cache is a generalization of generic bypassing: a bypass cache adds state, and is not just restricted to bypassing from operands in flight.

The bypass cache state is very much like an accumulator microarchitecture. With at least one important difference:

If the bypass cache is CAM indexed by the physical register number in the main RF, then that is an important difference. A second important difference is whether the bypass cache can miss - whether a value you expect to receive from the bypasses or the bypass cache may not be there when you expect it to be, so that you have to fetch it froom the main RF. I sometimes call this a priori vs a posteriori caching.

If, however, there is separate renaming for accumulator registers and main registers, then you more exactly have an OOO implementation of, say, a flat RF ISA on an accumulator uarch. But you have all sorts of questions about policy: when do you make a copy of a main RF register in an accumulator, and vice versa.

(Hey, here's an idea: in the scheduler, CAM on main REF regnums. But also broadcast when a main RF regnum has been moved to an accumulator. Issue the instruction with non-CAM indexed accum regnums... How handlle replacemeent? Probably LRU, modified by overiting - so many values are only read once.)

== OOO implementattion of Accumulator ISA on Accumulator Uarch ==

What about combining the two? Well, obviously, you could rename from ISA accum into flat RF, and then into uarch accums. But that rather defeats the purpose.

We might rename accums and main regs separately. The ISA accums guide replacement in the physical accums.

Issue: what if ISA accums and physical accums are not the same in number?

Here's an observation: one of the biggest advantages of accumulator ISAs is that they tend to reuse the accumulators quickly. I.e. they tell us when values are dead, and hence can be more quickly recycled. You might get the same effect by a "kill register" bit.

= Summary =

Accumulators, whether ISA or uarch, are part of our tool kit.

I do not thing that there is a uniform way of saying one is better than another. Certainly, at the moment flat RFs preponderate.

Much of the benefit of accumulators can be obtained by bypass caching.

I would hesitate to add true accumulators to a new ISA unless (a) codesize really matters, or (b) it is an embedded system, with the ability to change uarch and ISA every generation.

(But then, that's what they always say.)

On the other hand, I am scared of implementing accumulators, whether in ISA or uarch.

== Accumulators for Extended Precision ==

I forgot to mention one of the biggest advantages of accumulators: extended intermediate precision.

Many DSPs have, for example, 24 bit accumulators to add chains of 16 bit numbers.

One might consider FMA or FMAC, floating point multipply add without intermediate rounding, to be a hidden internal accumulator. Except that it isn't - it flows. If you exposed that intermediate value...

I keep waiting for a nice formulation of exact, infinte precision, vector sum reductions, dot products, etc. Trivially, that can be done via a [[superaccumulator]] - but although I like superaccumulators, they are pretty expensive. What, 1K or 2K bits?

An ISA-level accumulator architecture of the form

superacc := 0
for all i superacc += a[i]

is nice in that you could imagine mapping it to the infinite internal precision versions of 4-way dot product, etc., that one sees in some instruction sets. (E.g. Microunity)

I'd like to see lower cost but still infinite precision wayys ofdoing such FP vector reductions.