Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Thursday, August 05, 2010

Chrome/Adobe PDF reader errors

I love the Chrome browser, but, in combination with Adobe Reader it frequently hangs reading PDFs. "There is an error reading the document", click OK and the same error.

Probably an Adobe error, but nevertheless the fact that it works with FireFox and IE but not Chrome looks bad.

Table indexing - CompArch

Table indexing - CompArch: "Table indexing
When I say that 'a table is indexed by a value V', I do not mean that all of the bits of V are used to form an address. Although that could be done, for something like a 64 bit address it could be a very large address.
Rather, I mean that the value V is used to determine what entry in the table is to be used. Some bits of V may be used as a number, to index the table. I call that RAM indexing. It may also use a hash of V.
Some or all of the bits of V may be matched against addresses already in the table. This is called CAM addressing.
RAM indexing and CAM indexing may be combined. E.g. RAM indexing may determine a set containing N tags, which are compared against the other bits of V to determine which entry within the set (often called which way of the set) should match.
For that matter, I also call it CAM indexing if bits of V are used to determine 1 of N bits, and that 1 of N code is compared to multiple entries in a table. See Decoded CAM versus Encoded CAM.
Thus, the term 'indexing' is general, not limited to the specific form of RAM indexing"

Cache Ways and Sets: opposing definitions

I have observed the following confusion in terminology wrt the meanings of "way" and "set" in a cache. I am most familiar with the terms as used at Intel, and earlier than that at Motorola and Gould.
I have encountered the alternate definition mainly with people [[DEC]] backgrounds,
although I am not sure that it is endemic.

First off, both camps agree on what is a 4-way associative cache.

E.g. let us discuss a 4-way associative, 64KiB cache, with 64B cache lines.
and a 64 bit physical address, with
* bits 0-5 are the *offset* within the cacheline
* bits 6-15 are the *index* within the cache
* the remaining bits, 16-63, are the tag

= My Terminology =

In the Intel terminology, the 10 index bits 6-15 select one of 1024 (2^10) "sets" within the cache.

Each set holds 4 cache lines. The tag bits of the address requested are compared to the tag bits of the 4 ways within the cache. If there is a match, the data for that way is accessed.

I.e. there are 1024 sets.  Each set holds an arbitrary subset of the 2^48 possible cachelines that map to that set, because they have that set's particular index field.

A way is one of the 4 elements within a set.
Although a set is an unordered collection,
we number the ways for reference:
 { set[index].way0, set[index].way1, set[index].way2, set[index].way3 }

A way can also refer to the collection of all of the ways of a particular number,
for all sets of all indexes.
It might be more correct to refer to this as a "way array"
e.g.
 way0_array = { set[i],way0 } for all i = 0..2^48-1

but the term "way array" is seldom used.
Usually, way is disambiguated by context:
e.g.
* "the processor allows you to lock down an entire way" (of the entire cache, across all sets of the cache)
* versus "address X is in way0 of set Y of the cache"

= Alternate Notation =

The alternate terminology, which I have observed mainly from DEC folk,
is to say that a 4-way associative cache has 4 sets (corresponding to "way array" above),
and that the i-th way is the set of the i-th element of each set,
where i is the index field of the address.

I prefer the other terminology, where "set" is unordered, as in mathematics.

Wednesday, August 04, 2010

Alphabet_Soup:_a_Collection_of_Microarchitectures#Basic_Pipelines

Attempting to describe some microarchitectures in text-only email to a friend inspires me to post this. It would be much better to have lots of drawings illustrating precisely what I am talking about; but this textual notation is fairly compact, can be understood by some people, and in some ways, because it underspecifies certain configurations, allows us to talk without getting bogged down in multiple different ways of interconnecting the blocks. I.e. sometimes drawing is TOO specific.

= Basic Pipelines =

Let's start out with in-order pipelines
 InO
and out-of-order pipelines
 OoO

A typical in-order pipeline might look like
 InO = BP -> I$ -> D -> RD -> X -> L1$ -> WB
with pipestage names
* BP = branch prediction
* I$ = instruction cache
* RD = register read
* D = decode
* X = execute
* L1$ = L1 data cache
* WB = writeback

A prototypical out-of-order pipeline might look like
 OoO = BP -> I$ -> D -> RR -> S -> X -> L1$ -> IW
with additional pipestages
* RR = register rename
* S = schedule
* IW = instruction window - ROB, etc.

Where are the PRF reads?

On a P6-style read-before-schedule & capture machine:
 OoO = BP -> I$ -> D -> RR -> RD -> S -> X -> L1$ -> IW

On a HaRRM (1987) or Wmt-style read-after-schedule machine:
 OoO = BP -> I$ -> D -> RR -> S -> X -> RD -> L1$ -> IW

I am not going to go into details about this, or the ROB/PRF, because that's not my topic of discussion here.

Actually, the execution units may be superscalar, or they may share the same single port.
I could contrive ways of drawing this, but again it's not my real point.
Similarly, the L1$ is often accessed in parallel with the first few cycles of X.
I might depict this as
  X || L1$
or just
  X L1$
rather than
  X -> L1$

In fact, eliding the arrows overall makes things less cluttered
A typical in-order pipeline might look like
 InO = BP I$ D RD X L1$ WB
 OoO = BP I$ D RR S X RD L1$ IW

and it allows us to throw other stuff on, such as the L2$
 InO = BP I$ D RD X L1$ WB       L2$
 OoO = BP I$ D RR S X RD L1$ IW    L2$

we can stretch the notation to indicate store queues, etc; it is nice to use the DEC/AMD term LSU, even though Intel tends to split them up

= SMT =

SMT, symmetric multithreading, like Intel Hyperthreading, runs multiple threads on the same out-of-order pipeline.
 SMT2(OoO) = SMT2(BP I$ D RR S X RD L1$ IW)    L2$
Not indicated: the fact that I$ is shared, but renamer is replicated, etc.

We might also have SMT in-order machines, like Intel's first generation Atom:
 SMT4(InO) = SMT4(BP I$ D RD X L1$ WB)       L2$


By comparison, we can emphasize the single threaded nature of the orignal pipelines:
 ST(OoO) = ST(BP I$ D RR S X RD L1$ IW)    L2$
 ST(InO) = ST(BP I$ D RD X L1$ WB)       L2$


= Multicluster =

The original multicluster microarchitectures that I am familiar created wider superscalar machines by replicating the execution units.  They were multicluster in that they did not have uniform bypassing: some might bypass within the cluster fast, but impose an extra cycle bypassing to other clusters; some might not permit bypassing at all, except perhaps through special inter-cluster instructions.


We can attempt to draw this in ascii art as

 MC(OoO) = BP I$ D RR S X RD L1$ IW    L2$
                        X

 MC(InO) = BP I$ D RD X L1$ WB         L2$
                      X

but I will instead draw it as


 MC(OoO) = BP I$ D RR S MC2(X) RD L1$ IW    L2$
 MC(InO) = BP I$ D RD MC2(X) L1$ WB         L2$
                  
The question always arises as to how much is replicated in the clusters:
 MC(X)
 MC(X RD)
 MC(S X)
 MC(S X RD)
 MC(S X RD L1$)
This last will be revisited in MCMT.

Execution units can be shared between clusters, like the FPU in AMD Bulldozer.
It's a pain to depict this in ascii art, so I won't bother here.

Should the renamer be inside the clusters?  Depends. It's easier to maintain single thread semantics if the renamer is outside, and similar the LSU:

 MC(OoO) = BP I$ D RR MC2(S X RD L1$) LSU IW    L2$

But since the S X LSU unit is as critical as the S X L1$ loop, it is natural to put it inside. But then coordinating the two clusters becomes a pain.  You might, e.g., copy all stores to both clusters' LSU.
I don't know how to draw that in ascii art.

 MC(OoO) = BP I$ D RR MC2(S X RD L1$ LSU) IW    L2$

You start down the slippery slope of complexity if you have an L1 LSU iside the cluster for speedd,
and an L2 LSU outside the cluster to coordinate things.
Note that the LSU2 need not have any data.
 MC(OoO) = BP I$ D RR MC2(S X RD L1$ LSU1) LSU2 IW    L2$

We'll assume that IW is outside the cluster for single thread


= MCMT =

Once you have multiple clusters, you can run multiple threads on them.
One thread per cluster is natural. It permits the nice simplification of not having to do bypasses between clusters.

 MCMT2(OoO) = BP I$ D RR MCMT2(S X RD L1$) IW    L2$

With separate threads, an LSU inside the cluster is natural
 MCMT2(OoO) = BP I$ D RR MCMT2(S X RD L1$ LSU) IW    L2$
but the stuff talked about above may apply.

For separate threads, it might be natural to put IW inside the cluster.  But IW, instruction window management such as ROB, is not in the critical loop.  If you put IW outside, you might be able to share non-uniformly, e.g. giving simple threads 1/4 of the IW, and complex 3/4.
I call one way of doing this a [[segmented sequential instruction window]].


= MCMT and SMT =

You could make each cluster itself SMT:

 MCMT(SMT(OoO)) = BP I$ D RR MCMT2(SMT(S X RD L1$ LSU)) IW    L2$

although now you start driving the ifetch prety hard.

= Speeding Up Single Threads =

SMT and MCMT are ways of supporting multiple threads. It would be nice if they could speed up single threads.

MC is used for single threads.  However, it is a challenge to do scheduling for truly decoupled MC microarchitectures.  One additional cycle of bypassing can be afforded, but it gets harder the more decoupled the clusters are.

My old, old, name for using multiple threads to speed up a single thread was IMT, implicit multithreading.

SpMT, speculative multithreading, is the form of IMT I am most familiar with,
There are other forms of IMT - more fine grained than SpMT, with executes contiguous sections of the instruction such as loop bodies and regions separated by CALLs and RETurns.

IMT and SpMT need to run on a multithreaded substrate.
Above we have described two multithreaded substrates
 SMT
 MCMT
with the obvious addition of
 MP
Any of which can be InO or OoO. MCMT can be built with SMT inside the clusters.
And MP, multiprocessor, separate cores, can be built out of single threaded cores, multithreaded cores, etc.
 MP =
     MP(ST)
     MP(SMT)
     MP(MCMT)
     MP(MCMT(SMT))

You can therefore run, or at least think about running, IMT or SpMT on any of these multithreaded substrates:
   SpMT(SMT)
   SpMT(MCMT)
      SpMT(MCMT(SMT))
   SpMT(MP)
     SpMT(MP(ST))
     SpMT(MP(SMT))
     SpMT(MP(MCMT))
     SpMT(MP(MCMT(SMT)))

Overall, SpMT(MP(ST)) is hard.  It is hard to overcome the inter-core latencies when forking threads.

MP(SMT) is doable - indeed, it is Haitham Akkary's DMT.
However, SMT is often limited to only 2 threads, which really does not give you much opportunity to
speed things up with SpMT. And the speculative threads tend to interfere with the non-speculative thread. First, do no harm.
::SpMT(SMT2) - low potential

I motivated MCMT as a way of building a multithreaded microarchitecture for SpMT, that might be able to support more threads.  Alternately, I motivated MCMT by itself, without SpMT, as something intermediate between SMT and M.

SpMT(MCMT(ST)) is okay, but there is still probably overhead to go between clusters, felt on forking.
By the way, SpMT(MCMT(ST)) probably has lots of other stuff added to preserve sequential semantics.
Won't discuss that here, although I will mention that I am found of log based execution.

SpMT(MCMT(SMT2)) is a potential sweet spot.  I think the normal operating mode would be with one thread per cluster, but the SMT2 allows a "run thread" to be used to fork off the non-speculative thread, or a less speculative thread.  The runt thread could cycle steal transferring state from cluster to cluster, and thus minimize slowing down the non-speculative thread.

SpMT(MP) is hard, as mentioned above.

But SpMT(MP(SMT2)) is another sweet spot - with the runt thread trick again played to help forking.

So, we have two attractive configurations:
 SpMT(MCMT(SMT2))
and
 SpMT(MP(SMT2))
with the inner SMT2 used to speed up forking.

SpMT(MCMT(SMT2)) is probably smaller,
but  SpMT(MP(SMT2))
makes better use of the many, many, cores we have nowadays.

Combining both in
 SpMT(MP(MCMT(SMT2)))
is imaginable,
but it is more complex than doing just
SpMT(MP(SMT2)).
Not worth doing unless there is some big win.
Or unless you want to fill a product line hole with MCMT(SMT2)
- e.g. something larger than a single core ST or SMT2(St), but smaller than MP2.

= Multilevel Schedulers, etc. =

TBD: S2 S1 schedulers, etc.

TBD: batch scheduling across clusters versus fine grain de-interleaved scheduling.


= See =

Wiki at
http://semipublic.comp-arch.net/wiki/Alphabet_Soup:_a_Collection_of_Microarchitectures#Basic_Pipelines

Saturday, July 31, 2010

Dual Tablet PC

I love my tablet PC.  But...

Actually, I prefer reading articles and academic publications on paper.
But it is a hassle to print papers downloaded from the web
And there is always the hope that one day decent annotation software will arise.

Anyway, reading papers on my tablet PC is infinitely preferable to reading papers on my desktop system.
The latter is work, and literally a pain in the neck.
The former is a pleasure.

But: when I used to read papers at school in the library,
I would have the journal or the paper on one side of the desk,
and my notebook in the other.

Problem: I only have one tablet PC.  Splitting the screen spatially or temporally between the paper I am reading and my notebook software is a pain.

What I want is a dual tablet PC.  One PC, or at least one interface, that has two tablet surfaces.

Possibly attached.

Possibly one e-paper optimized for reading, and a second dynamic for writing.

Or possible 2 displays that fold open like a book.  (TBD: I have a link to such a product proposal.)

But even more likely two independent tablet surfaces.  Separately positionable.  I often used to read books with my notebook or notepaper oriented perpendicular.

It might be acceptable, or even desirable, to tether the to displays by a cable.  Just like I like tethering my pen: prevents them getting lost or separated.

But it might equally well be desirable to link the different display surfaces by radio or other wireless.

(Funny idea: body area networking, through the user body, for the link.)

Possibly could be separate PCs, but want to couple e.g. so that cut and paste from one table to the other can work.

---

Not just dual tablet PC.  Multi tablet PC  But dual would be a good start.

---

Blogged at http://andyglew.blogspot.com/2010/07/dual-tablet-pc.html

Thursday, July 29, 2010

Outlook late and repeated reminders

Outlook just flashed a reminder, telling me I had missed several meetings by 2 days to 22 hours.

Second time it has done it today, for the same meetings.

There was a crash in between.

Annoying.