The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, April 18, 2011

Wide operand cache

= By Microunity =

The [[wide operand cache]] is a concept originally by [[Microunity]] =

* you want to design a RISC machine, with "naturally 32 or 64 bit regissters.
** or even 1024 bit registers - no matter how wide you go, there will nearly always be a reason to have significantly wider operands.
* but you also want to support "simple" instructions that just happen to have very wide operands.

See [[#Examples of instructions with wide operands]].
For the purposes of this discussion, we will consider [[BMM (bit matrix multiply)]].
BMM wants to have one, say, 64 bit operand, and a second operand, the matrix, that is 64x64 bits in size - 4Kib.

Defining an instruction such as
BMM destreg64b := srcreg64b * srcreg4x64b
could be possible - but it has many issues:
* how many 64x64=4Kib operand registers do you want to define? 1? 2? 4?
* you probably do not want to copy such a wide operand around - instead, you might want it to live inside its execution unit

The [[wide operand cache]] approach is as follows: define an instruction with a memory operand
BMM destreg64b := srcreg64b * M[...]
* You may spercify the memory operand simply as register direct, M[reg], or you may define it using a normal addressing mode such as M[basereg+offset], or even using scaled indexing.

Conceptually, the wide operand is loaded before every use:
BMM destreg64b := srcreg64b * M[addr]
tmp64x64b := load M[addr]
destreg64b := srcreg64b * tmpreg64x64b

However, we may avoid unnecessary memory traffic by caching the wide operand.

This leads to the basic issue: [[#TLB semantics versus coherent|]].

= TLB semantics versus coherent =

The [[wide operand cache]] concept admits a basic question:
* do you snoop the wide operand cache, keeping it coherent with main memory, so that if somebody writes to a cached wide operand it is reflected in the next execution of the instruction?
* or do you use "[[TLB semantics]]", i.e. making it a [[noncoherent cache]], requiring the user to explicitly invalidate it (requiring an [[invalidate wide operand cache(s) instruction]].

If coherent then an implementation can vary the number of wide operand cache entries transparently.

If noncoherent then not only can the number of entries be detected, but also you run the risk of getting different answers depending on the context switching rate.

[[Glew opinion]]: I prefer coherent, but would live with noncoherent if necessary.

= So Close... =

Frustrating anecdote: at AMD I was working, with Alex Klaiber, out how to support instructions like with very wide operands, such
as [[BMM]]. My whiteboard was full of scribblings, with the basic question of [[#TLB semantics versus coherent|]].

I then went to a meeting where Microunity presented their patents.

So close... Actually, probably off by 5-10 years. But I was following a path that I did not know had been trailblazed...

= Examples of instructions with wide operands =
Any instruction that has an operand that is significantly wider than some of its inputs,
and which either tends to be constant, or which tends to be modified in place,
is a candidate for a [[wide operand in memory]] implemented via a [[wide operand cache]].
For that matter, one could have wide operand instructions whose operands are all pseudo-regosters in a wide operand cache.
* [[BMM (bit matrix multiply)]] - 64b X 64x64b
* [[vector-matrix instructions]] - N X NxN
* [[permutation index vector instruction]] - Nbits*log2(Nbits)
* [[permutation bit matrix instruction]] - really a form of [[BMM]]
* [[superaccumulator]] - 32 bit - thousands ...
* [[regex instructions]] - with large regex operands that can be compiled
* [[LUT (lookup table) instructions]]
* [[texture sampling]]
* [[interpolation instructions]]
* [[CAM instructions]]


No comments: