The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, April 18, 2011

Potential customers for tagged memory


There are many potential customers for [[tagged memory]] or [[memory metadata]].
I have long maintained a list (or, rather, lists, since I have had to recreate these lists on 3 occasions)
of uses for tagged memory.
Here is a list that is by no means complete (previous versions of this list have exceeded 20 items).

Which should be allocated the tags?
Obviously, my favorite idea...

Glew opinion: tagged memory should be microarchitecture, not macroarchitecture. If there are physical tags, it is a good idea to be able to use them to speed things up. But there should always be a fallback proposal that does not depend on having main memory tags. See Milo Martin's Hardbound for an example of how to have metadata in memory, without tags.

* Tagged memory to identify data types - integer, floating point ...
:* Problem: there is an infinite number of types ...
:* 1..n bits. Per qword? Maybe per byte?
* Tagged memory to aid in [[garbage collection]]. [[Read and write barriers]].
:* 1 bit (or more) per minimum allocation unit, 1-4 words
* Tagged memory to aid security, e.g. to create [[non-forgeable pointers]] as in the IBM AS400
:* This may well be my favorite use of only a single bit of tagged memory.
:* 1 bit per pointer - 64 or 128 bits
* Tagged memory for [[taint or poison propagation]], as in [[DIFT (Dynamic Information Flow Tracking)]]. E.g. Raksha.
:* 1 bit per ... word? byte?
:* Raksha uses multiple bits
* Tagged memory for [[transactional memory]] support
:* 1 bit per ... many TM systems use cache line granularity, although smaller better
* Tagged memory for [[debugging]]
:* 1 bit per ... can deal with coarse granularity, e.g. by trapping, determining not interested, and going on.
* Tagged memory for [[performance monitoring]] - e.g. to mark all memory accessed in a time interval.
:* 1 bit per cache line
* Tagged memory for [[uninitialized memory]]
:* 1 bit per ... byte? word?
* Tagged emory for [[synchronization]] of parallel programs. E.g. [[full/empty bits]].
* Fine grain [[copy-on-write]] data structures
** [[Myrias style parallelism]]
:* 1 bit per ... byte? word? ...

Heck, I suppose that conventional width oriented ECC can be considered a form of tagged memory. And although ECC is common, at least at the time of writing it is not in 99% of all PCs. It causes problems enough to warrant the invention of [[Poor Man's ECC]], a non-width oriented implementation of ECC that avoids physical tagged memory.

Wide operand cache

= By Microunity =

The [[wide operand cache]] is a concept originally by [[Microunity]] =

* you want to design a RISC machine, with "naturally 32 or 64 bit regissters.
** or even 1024 bit registers - no matter how wide you go, there will nearly always be a reason to have significantly wider operands.
* but you also want to support "simple" instructions that just happen to have very wide operands.

See [[#Examples of instructions with wide operands]].
For the purposes of this discussion, we will consider [[BMM (bit matrix multiply)]].
BMM wants to have one, say, 64 bit operand, and a second operand, the matrix, that is 64x64 bits in size - 4Kib.

Defining an instruction such as
BMM destreg64b := srcreg64b * srcreg4x64b
could be possible - but it has many issues:
* how many 64x64=4Kib operand registers do you want to define? 1? 2? 4?
* you probably do not want to copy such a wide operand around - instead, you might want it to live inside its execution unit

The [[wide operand cache]] approach is as follows: define an instruction with a memory operand
BMM destreg64b := srcreg64b * M[...]
* You may spercify the memory operand simply as register direct, M[reg], or you may define it using a normal addressing mode such as M[basereg+offset], or even using scaled indexing.

Conceptually, the wide operand is loaded before every use:
BMM destreg64b := srcreg64b * M[addr]
tmp64x64b := load M[addr]
destreg64b := srcreg64b * tmpreg64x64b

However, we may avoid unnecessary memory traffic by caching the wide operand.

This leads to the basic issue: [[#TLB semantics versus coherent|]].

= TLB semantics versus coherent =

The [[wide operand cache]] concept admits a basic question:
* do you snoop the wide operand cache, keeping it coherent with main memory, so that if somebody writes to a cached wide operand it is reflected in the next execution of the instruction?
* or do you use "[[TLB semantics]]", i.e. making it a [[noncoherent cache]], requiring the user to explicitly invalidate it (requiring an [[invalidate wide operand cache(s) instruction]].

If coherent then an implementation can vary the number of wide operand cache entries transparently.

If noncoherent then not only can the number of entries be detected, but also you run the risk of getting different answers depending on the context switching rate.

[[Glew opinion]]: I prefer coherent, but would live with noncoherent if necessary.

= So Close... =

Frustrating anecdote: at AMD I was working, with Alex Klaiber, out how to support instructions like with very wide operands, such
as [[BMM]]. My whiteboard was full of scribblings, with the basic question of [[#TLB semantics versus coherent|]].

I then went to a meeting where Microunity presented their patents.

So close... Actually, probably off by 5-10 years. But I was following a path that I did not know had been trailblazed...

= Examples of instructions with wide operands =
Any instruction that has an operand that is significantly wider than some of its inputs,
and which either tends to be constant, or which tends to be modified in place,
is a candidate for a [[wide operand in memory]] implemented via a [[wide operand cache]].
For that matter, one could have wide operand instructions whose operands are all pseudo-regosters in a wide operand cache.
* [[BMM (bit matrix multiply)]] - 64b X 64x64b
* [[vector-matrix instructions]] - N X NxN
* [[permutation index vector instruction]] - Nbits*log2(Nbits)
* [[permutation bit matrix instruction]] - really a form of [[BMM]]
* [[superaccumulator]] - 32 bit - thousands ...
* [[regex instructions]] - with large regex operands that can be compiled
* [[LUT (lookup table) instructions]]
* [[texture sampling]]
* [[interpolation instructions]]
* [[CAM instructions]]