The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, August 30, 2010

Memory Transaction Topology

* wiki: http://semipublic.comp-arch.net/wiki/Memory_Transaction_Topology
* blogged, comp.arch posted, hbc mailing list.

Let us discuss the simple shape or topology of memory transactions,
from the point of view of a simple interface, with a processor on one side,
and memory on the other.

When I have drawn such diagrams in the past, I have often used the vertical axis to indicate the passing of time.
Here, drawing in ascii, I will not bother to do so.

= Simple Memory Accesses =

;Simple read
* processor sends an address to memory
P ---A---> M
* sooner or later, memory responds with the read data
P <---D--- M ;Simple Write * processor sends address and data to memory P ---AD--> M
Acknowledgement may or may not be necessary; it may be [[store and forget]].

The store address and data may be multiplexed oin the same wires, asis roughly indicated above.
Or they may be in parallel.

* processor sends address and data to memory
P -+-A-+-> M
+-D-+ M

Or they might even be two separate transactions, joined at the memory

P ---A---> M
P ---D---> M

= Interleaved Memory Accesses =

;Interleaved read

Interleaved reads may be satisfied in the original order
P ---A1---> M
P ---A2---> M
P <---D1--- M P <---D2--- M Or out of order. If ooo, then some sort of transaction ID is needed: P ---A1---> M
P ---A2---> M
P <---D2--- M P <---D1--- M Simple writes with address and data available simultaneous do not provide opportunity for interleaving. P ---A1D1--> M
P ---A2D2--> M

However, separate address and data transfiers do:
P ---A1---> M
P ---A2---> M
P ---D1---> M
P ---D2---> M

or out of order:
P ---A1---> M
P ---A2---> M
P ---D2---> M
P ---D1---> M

We usually assume that the address is sent first, but this need not be the case.
P ---A1---> M
P ---D2---> M
P ---D1---> M
P ---A2---> M

Usually, however, interleaving or reordering of stores refers to with respect to the order of stores between different proocessors. TBD.

= Burst Accesses =

Memory transfers often occur in cache line sized bursts, frequently 4 or 8 clock cycles, sometimes more.
Interleaved reads may be satisfied in the original order
P ---A------> M
P <---DDDD--- M Stores may have address and data near to each other P ---ADDDD--> M

Decoupled burst store address and store data is also a possibility, but is not common.

= Multiple Transaction Sizes =
Thus we have severakl different transaction sizzes

;Address Only
P ---A------> M

; Burst Read Data Reply
P <---DDDD--- M ; Burst Store P ---ADDDD--> M

;Various non-burst or partial
in 8b, 16b, 32b, 64b, 128b, ... sizes
as well as possibly misaligned transfers
P <---D--- M = Read-Modify-Write = The following read/write transaction is usually associated with atomic read modify write operations. (e.g. the GUPS benchmark) P ---AopD---> M
store new data value
return old
P <---D------ M Note that this can also be done in combination with a cached value P $ ------------AopD---> M
op op
store store new data value
ignore (or check) return old
P $ <-----------D------- M == Modify Write == It is possible to have a modify write transaction, where an operation is performed at memory, but the old value is not returned. P ---AopD---> M
store new data value
X <-- M Aread modify without the write is even less useful; it really amounts to having an implicit operand that does not need to be transferred. = Masked Write, including Burst Masked Write == It is common to have a bitmask specifying which bytes should be written: P ---AmD---> M
The mask is so small that it is often part of the command.

Often a read-modify-write is necessary at the receiving end, e.g. for ECC.
But nevertheless, not across the interconnect.

Less commonly there is a masked read, e.g. for side-effectful memory.

There can also be burst masked writes, where the mask may approach and surpass a data chunk width.
P ---AmDDDD---> M

With [[Bitmask consistency]], burst masked writes become the norm.
Reads that indicate which bytes are an absolute requirement,
whether burst masked reads or otherwise, may also be useful.

= Scatter/Gather =

Rather than separate non burst reads
P ---A1---> M
P ---A2---> M
P <---D2--- M P <---D1--- M ... or writes, or RMWs We might take advantage of burst transfers, even though not necessarily burst memory (DRAM) accesses P ---A1.A2.A3.A4---> M
P <---D1.D2.D3.D4--- M P ---A1D1.A2D2.A3D3.A4D4---> M

P ---A1opD1.A2opD2.A3opD3.A4opD4---> M
P <----------D1.D2.D3.D4------------ M For the RMW casse we might constrain the same op to be used everywhere: P --op-A1D1.A2D2.A3D3.A4D4---> M
P <----------D1.D2.D3.D4------------ M = BLUESKY: Reducing the Transaction Types in a Scatter/Gather Memory Subsystem = ; read P ---A---> M
;gather read
P ---A1.A2.A3.A4.A5.A6---> M

; write
P ---op.A.mask.DDDD---> M
; scatter
P ---op.A1D1.A2D2.A3D3---> M

; reply
P <---id.DDDD--- M If A and D chunks are the same size, * the "normal" burst is 6 chunks long (A.mask.DDDD) * with 6 gather read addresses * or 3 scatter write or RMW address/data pairs We might even do away with partial reads (---A--->).
Possibly make all replies look like writes.
; reply
P <---op.A.mask.DDDD--- M

Further, we might compress the values transferred.
Compression does not help fixed size transactions with fixed size memory cache lines,
except for power.
But compression does help fixed size cache lines with variable length transfers.
And compression will allow more scatter/gather requests to fit in a single packet.

In general, we would assume that in a multistage interconnect s/g packeets will be broken up and routed s/g element by element.

Circuit switched, e.g. optical, works against this.

Thursday, August 26, 2010

New variant of old Windows graphics "window contour trails" bug

This morning, just now, encountered a new variant of an old bug.

I have often run into what I call the Windows contour trail bug: when dragging windows around the screen, they leave a trail of old, partially overwritten, window contents. Looking rather like mouse trails. As if the windw was being redrawn every few fractions of a second, each time overwriting the drawing of the window at its old position, and the background was never replacing the overwritten window contents at the old position. Fills the screen with rather pleasant abstract art.

Unfortunately, a reboot usually seems to be necessary.

Today, same Windows contour trail problem - but on only one monitor. I have 4 monitors connected to that machine. The laptop LCD panel, and 3 USB display adapters. Only one of the monitors, connected to only one of the USB display adapters, has gone buggy.

Which gives a few hints: its not a problem with the overall graphics display infrastructure. Apparently just with one display driver.

... While writing this, apparently fixed itself. Possibly related to going in and out of standby. Possibly related to Windows Explorer desktop shell hanging on a network drive.


All sorts of thoughts about micro-rebooting running through my head.

Wednesday, August 25, 2010

Hint instructions - CompArch

Hint instructions -

: "Hint instructions"


Oftentimes the hardware, the processor (or memory subsystem, or ...) microarchitecture,
wishes to know information that the programmer or compiler might already know.

Similarly, oftentimes the compiler wishes to know something that the programmer might already know.

Conversely, oftentimes the compiler wishes to know something that the hardware might already know,
if the program has already been executed several times. This last is my cute way of trying to fold
[[profiling feedback]] into this discussion.

Such questions might include:

* "Is this conditional branch more likely to be taken, or not?"
* "Is this load likely to forward from an earlier store, and if so, which?"
* "Is this load likely to be a cache miss?"

TBD: create a canonical list of operations for which hints have been proposed.

= Hints to compiler =

Computer architecture folklore, that probably has a basis in fact (TBD, get reference):
the original FORTRAN compiler allowed the programmer to say which alternative (of its 3-way arithmetic IFs) was most likely.
I.e. it allowed the programmer to give a hint to the compiler.
The story goes on to say that the human programmers were usually wrong.

In any case, there are often ways of conveying hints to the compiler. #pragmas are one example.
[[Profiling feedback]] is another.

Having received a compiler hint, the compiler can change code generation, e.g. the order of basic blocks. Or it might choose to implement the [[Hint instructions]] to hardware described here.

GLEW OPINION: hints from human to compiler should allow at least two representations: (1) inline, and (2) separate.


It should be possible to provide such hints in the source code, as close as possible to where it applies:

#pragma IF-THEN probablility = 90%

I am agnostic as to whether the hints or #pragmas should be before the statement, or inside the statement

#pragma IF-THEN probablility = 90%
#pragma IF-THEN probablility = 10%

However, it should be noted that designing a pleasant and unambiguous set of #pragmas is a hard language desgin issue in the first place.

I am NOT agnostic about the following: there should at least two representations: (1) inline, and (2) separate.

Sometimes the programmer does not have the ability to modify the source code.
Then it is necessary to be able to specify the hints separately,
e.g. on the command line
cc -pragma "foo.cc line 22 IF 90%" foo.cc
or in a separate file
foo.cc line 22 /if( c1 < c2 )/ probability 90%
bar.cc line 19 /.../ probability 10%

It is, of course, clumsy to specify the location where hints should apply in a separate file.
Line numbers, possibly with regexps to apply when the line numbers change by editing, may be required.

But it is necessary.

= Hint Instruction Set Embeddings =

How can hints be embedded in the instruction set?

* Fields of non-hint instructions
** CON: takes bits away from instruction encodings that are already tight.
** e.g. BRANCH-AND-LINK, with coroutine versus call semantic hints

* Separate HINT instructions
** CON: takes instruction fetch bandwidth
** PRO: more bits available, and does not waste bits if hints are not used

* Separate HINT instruction stream
:: Just as I recommend that compiler hints can be placed inline versus out of line, it has been proposed by some researchers to have hints, and possibly other operations, in a separate instruction stream, paralleling the normal instruction stream. Just as with compiler hints in a separate file, issues arise as to how to establish the correspondence - typically by IP.

* Hint prefixes
On a variable length instruction set like x86, certain prefixes can be applied to existing instructions. If the prefix is oroginally a NOP, it can be used to provide HINT semantics.
Similarly suffixes, and other modifiers.
* PRO: the hint cost is paid only by the hint user
* CON: variable length instructions ad prefixes are a pain to decode

See [[Call skip and hints]] for a proposal to hide hint instructions in the shadow of unconditionally take branches, to reduce overhead with less complexity than a hint instruction stream.

Another intermediate form is to have hint instructions in the instruction stream, but in a form that allows them to be hoisted out of loops and far away.
E.g. instead of
HINT: next loop branch is taken 32 times
jnz loop
One might do
HINT: loop branch at IP=label is taken 32 times
label: jnz loop

This makes the trigger for the hint more complicated, but reduces the instruction decoding pressure.

= Hint Instruction Set Encodings =

Probably the most important principle in instruction set design is to trap undefined encodings, so that they can be used for new instruction set extensions on new machines, and so that those new instructions can be trapped and emulated on old machines.

However, a slightly less important principle is to, sometimes, reserve multiple flavors of NOP instructions.
Less frequently NOP operands.
Such NOPs can be reserved for hintability in future processors,
but are treated as NOPs by present processors.

    ;Anecdote, or, Giving Credit:
    Although the principle of reserving NOP HINT instruction set encodings was known during Intel P6,
    Ronny Ronen of Intel Israel took the next step:
    he realized that we could take several existing undefined instructions that took a #UD trap,
    and redefine them to be NOPs reserved for future HINTs.
    Although these new NOPs would have to be trapped and hinted,
    several years later most of the marketplace would be full of machines that treated them as NOPs
    - and at THAT future time they could be re-defined as GHINT instructions,
    emulated effuciently as NOPs on older machines,
    and only trapped slowly on really old machines no longer common in the marketplace.

Note that it is important that, although treated as NOPs by the hardware, the compiler be discouraged from using them,
from using any but the canonical NOP.
Should the compiler or programmer start using the different flavors of nop,
let's call them NOP_HINT_0 ... NOP_HINT_15 or similar,
to convey information (e.g. information to a hukan viewing an instruction trace in a debugger),
then it may be difficult to grant them specific hint behavior in the future.

= Prefetches and Hints =

Are prefetch instructions hints? It depends.

If prefetch instructions can be ignored, yes, then they are hints. You might choose to ignore a pefetch instruction if some other prediction mechanism overrides it.

If prefetch instructions have side effects, such as setting [[access or dirty bits in page tables]], or even taking [[page faults]],
then it is questionable whether prefetch instructions are hints.

If these side effects are optional, not required, then even such a [[prefetch instruction with architecturally visible side effects]] (*) can be considered hints. If these side effects are not hints, then the prefetch instruction really has significant non-hint consequences.

: * Note: WikiWish: I would like to have [[here labels]] - the ability to define a term inline, saying "here is the definition of a term", using some notation like double curly braces used to define wikilinks. Without all of the formality of creating a wiki page or section. Somtimes the best definitions are given in context, rather than broken out.

= Branch Prediction Hints =

== Call indirect default target hints ==

* [[Call skip and hints]] and [[How_to_use_the_extra_operands_of_indirect_call_instructions#CALL-INDIRECT-WITH-SKIP the discussion that led to it]]

= Security and Reliability are NOP HINTs =

It is possible to implement security and reliability instruction set extensions as NOP HINTs on older machines.
For example, Milo Martin's HardBound bounds checking that detects buffer overflow bugs.

Think about it: for a correctly written program with no buffer overflow bugs,
it should be possible to disable all such security checks, and the program run correctly.

Only incorrect programs would break if the security checks became NOPs.
(Or, malicious programs would break-in.)

Similar to
Milo Martin's HardBound bounds checking that detects buffer overflow bugs,
which is itself a mechanism for improving software reliability,
various RAS instruction set features can be treated similarly.

= Memory Ordering =

On a sequentially consistent memory ordering model, synchronization fences are hints or nops.

However, on weaker memory ordering models, they are not.

Tuesday, August 24, 2010

Welcome, Andy! | LinkedIn

My LinkedIn status: Hotchips is over. AMD Bulldozer was bittersweet. Godson, from China, looks good, looks like it will give the US a run for the money. IBM Power7 PERC does propert scatter/gather with their GUPS remote atomic update instruction. Dark Silicon is the new buzzword. Explaining what Intellectual Ventures does 10 times a day strengthened my hope in IV's mission. Interacting with other computer architects feels good. Now back to work

Editing How to use the extra operands of indirect call instructions - CompArch

Editing How to use the extra operands of indirect call instructions - CompArch: "How to use the extra operands of indirect call instructions"


= ISA Encoding Issues and Branches =

Instruction sets often support standard encodings with fixed fields for operands such as register sources and destinations, or [[immediate literal offset constant]] operands:
* OP: destreg := src1reg OP src2reg
* OP: destreg := src1reg OP immediate
* OP: destreg :OP= src1reg
* OP: destreg :OP= immediate
Control flow instructions usually don't fit this pattern. I.e. control flow instructions such as jumps don't really have a destination register, unless you count the program counter. Moreover, control flow instructions want to have a large scope - i.e. they want the immediate constant to be as broad as possible.

E.g while the following reuse of a standard format "fits":
* BRANCH pcreg += offset
and allows other register+=constant arithmetic to benefit, even on machines that normally have a three operand format (dest,src1,src2 reg or constant)
* ADD destreg += large-constant
it is even better to fold the bits used for destreg into the offset, and let the branch instructions have an implicit PC destination, or really src/dest for PC-relative branches.
* BRANCH even-larger-offset
:: pc += even-larger-offset

(This even-larger-offset is most easily used for unconditional branches. Conditional branches need sme bits to specify either the register operand containing the condition, or the nature pf the condition on a condition code ISA.

This discussion of even-larger-offsets applies to "near" or "local" branches and CALLs - usually PC-relative. Indirect branches and indirect calls do not necessarily need an even-larger-constant operand. So the question aises as to what instruction format should they have?

= Indirect Branch =

Probably simplest is to have a single register field format. For indirect branch:
:: pcreg := reg

We won't talk any more much about indirect branch much after this. Single register field is good enough, except for possibilities such as
:: pcreg := offset (of a jump table) + reg
:: OR pcreg := offset (of a jump table) + reg<versus
:: tmpreg := load( offset-of-jump-table + reg )
:: pcreg := tmpreg
Note that these two flavors are especially suited to switch tables.

The former, INDIRECT-BRANCH-TO-JUMP-INSTRUCTION-TABLE, computes an address and jumps to it. Typically the table is a table of jump instructions.
The scale factor allows more than a single instruction to be in each entry in the table.
* PRO: This table can be embedded in the instruction stream without issues of mixing code and data, since it contains instructions. It can be PC-relative.
* CON: consumes extra entries in the branch predictor. Less efficient user of bits, since there will be jump opcodes.

The latter, INDIRECT-BRANCH-TO-MEMORY-TARGET, isn't really a RISC instruction, since it does a load and a temporary-register indirect branch. You can always synthesize it by separate instructions. As for INDIRECT-BRANCH-TO-MEMORY-TARGET
* PRO: smaller than separate instructions
* PRO: more efficient in use of space than INDIRECT-BRANCH-TO-JUMP-INSTRUCTION-TABLE
* CON: not RISCy

= Indirect CALLs =

Branch-and-link, a common RISCy form of a register indirect call has two register fields.:
* BRANCH-AND-LINK: linkreg := current-pc; pc := srcreg

However, calling conventions usually mandate a single fixed register for the link register. So it is suboptimal to waste bits specifying the link register:
* BRANCH-AND-LINK: fixed-link-register := current-pc; pc := srcreg

So now we have branch and link wanting to have a single register field. Similarly for a CALL instruction that pushes the old instruction pointer on a stack:
:: M[sp] := current pc
:: pc := srcreg
:: sp -= sizeof(pc)

(Note that such a call is transactional: you may not want to have modified the stack if the calructions as well.l fails, etc. Considerable ingenuity goes into creating CALL instructions with one, or the fewest possible, microinstructions.)

If you want to support a monadic register operand instruction, great. You might want to use it for other instructions as well.

However, there may be ways of using extra operands, particularly literal operands, for CALL indirect.

For example, many indirect CALLs are to the member functions of object classes. These are obtained by a vtable lookup:
* CALL reg+offset
:: tmpreg := load M[reg + small-offset]
:: pc := tmpreg

This is like BRANCH-INDIRECT-TO-MEMORY. Difference: with branch indirect used for switch tables, the register tends to be a small offset into the table, whereas for method calls the register tends to be a full range address of a heap object.

The usual pros and cons:
* PRO: smaller than separate instructions
* CON: not RISCy

Q: what else can we use extra operands on CALL indirect for?

= Default Indirect Target =

One of the biggest problems with indirect function calls (and to a lesser extent indirect branches) is target prediction.

On a machine with a short pipeline, you might only need a [[branch direction predictor]], taken/not-taken.
You might be able to get away without a [[BTB]], if the decoder arrives soon enough after instruction fetch,
since for most control flow transfers the branch target (or call target) is a constant, embedded in the instruction stream.

Not so for indirect transfers: the indirect branch or call target is not in the instruction stream.
Even if it is already known, it is hard to access.

On machines with deeper pipelines, [[BTB]]s (branch target buffers) are needed so that instruction fetch can be resteered even though the branch target from the instruction stream has not been decoded. But there is still an advantage to having the branch or call target in the instruction stream: the decoder can detect a BTB miss, and resteer, without waiting for execution.

Trouble is, every indirect branch or call needs a BTB entry.

Most instruction sets admit PC-relative or CS-relative or absolute function addresses.
Things are made even worse on instruction sets, such as the old IBM 360, where most CALLs are base register relative
- implying nearly all branches need a BTB entry.

But the transition to object oriented code makes this worse. Method calls, virtual function calls, are indirect, and hence require BTB entries.

What is especially frustrating is that oftentimes we know what the most probable indirect branch or CALL target is.
So we will proceed to discuss how to convey this knowledge across the instruction set to the branch prediction hardware.

== Hint ==

The most obvious way is to use HINT instructions. E.g.
:HINT: call target = F00

The hint instruction can be scheduled in advance.

== Front-end Registers and Prepare-to-Branch ==

Some instruction sets try to give the compiler explicit control by providing a set of front-end registers. You move an indirect branch or call target to such a register, and then branch. Can be separated. Unfortunately, often hard to separate: consider a pointer chasing loop calling a virtual function.

== Other Tricks ==

In the P6 (Pentium Pro) Code Optimization Handbook I recommended that switch table code be generated as follows:

:JMP (reg)
:default-target follows jump instruction

less ideally:
:JMP (reg)
:JMP default-target

First, because there was no HINT instruction in x86 at that time.

Second, more important, because at least instruction fetch would have prefetched the default target in the former case,
while decoder shortstop branch prediction might do the right thing in the latter case.

Arguably, these are better than HINT instructions, since they are not executed if branch prediction is working.
They only kick in if branch prediction is not working.

Unfortunately, these techniques cannot be used for indirect CALL.
It is impractical to rearrange function code to fall inline after the call instruction. especially if separately compiled.
And you cannot do

:CALL (reg)
:JMP default-target


:CALL (reg)
:CALL default-target

because the function return will return to the instruction after the original CALL.

It is tempting to try to cram a default target into the CALL instruction as an extra operand. But that's like a hint: it wastes bits if predicting correctly.

Here's a slightly less wasteful alternative: give CALL the ability to say that the next few instructions should be skipped on return:

:: push current-pc + skip
:: pc := reg (or pc := load M[ reg + offset ]
:CALL default-target


:BRANCH-AND-LINK: (reg),skip
:: linkreg := pc + skip
:: pc := reg (or ...)
:SOMETHING-WITH default-target
:: could be a hint, or a branch, or ...

This is not like x86 RETn: that specifies an stack offset. This is a code offset.

It is better to do this at CALL site than RETurn site, since it doesn't affect calling conventions.

Let's call this [[CALL-INDIRECT-WITH-SKIP]]. or that matter, could have [[BRANCH-AND-LINK-INDIRECT-WITH-SKIP]], although the latter is not really necessary.

The skip is really probably only a few instructions. I.e. it is not a very big extra operand. Maybe it should be full width: this gives us a sort of threaded code style.

:META-COMMENT: what we are seeing here is that, in some situations, it is better to move HINT instructions out of the code. In this case, in the code stream, but in a place where it would not normally be executed except by mis-speculation. Actually, in all cases of correct prediction, hint instructions are wasteful. There have been proposals to move hint and other meta-instructions ad directives completely out of line, to a code stream paralleling the original code stream. But this has complexity. Hiding hints after unconditionally take branches in this manner is easier to manage.

== Object Class Indication ==

Here's another way of using the extra operands for indirect CALLs.

Oftentimes, objects with virtual functions go to a single common place, that can be predicted early - perhaps at compile time.
However, with separate compilation and linking, the virtual functions may not really be known until link time.
Still, we can talk about having linkers patch the code to provide the possible hints.

But in some situations the virtual functions are not known at link time.
I.e. object inheritance is genuinely being used:
polymorphic objects with different virtual functions for the same method may be in use.

If the objects really are randomly and dynamically polymorphic, there isn't much we can do.

But oftentimes, polymorphism is being used, but is restricted or semi-constant. E.g. you might use a factory to create different polymorphic objects for an interface, but thereafter all instances of that object would be the same. I.e. the default virtual function target is not known at compile or link time, but is constant for the life of the program for objects of that class.

BTBs handle this, but at the cost of one BTB entry per virtual function call.

Here's a slight;y offbeat idea, slightly new to me: It may be advantageous for a compiler to inject a level of indirection:

Instead of
:CALL (reg)offset // using virtual function table
:CALL (reg)offset // using virtual function table
:CALL (reg)offset // using virtual function table

using a separate BTB entry for each

it may be better to do:

:JMP local-virtual-function-proxy-offset // using virtual function table
:JMP local-virtual-function-proxy-offset // using virtual function table

:JMP local-virtual-function-proxy-offset // using virtual function table

:local-virtual-function-proxy-offset: // using virtual function table
:: CALL (reg)offset

* PRO: a single BTB entry for the actual CALL
* CON: BTB entries for the jumps
** PRO: the jumps can be decoder predicted if BTB missing
* CON: would have to arrange for all indirect sites to use same register

It may be possible to use the extra operand of an indirect CALL to accomplish the same thing.

Basically, we are trying to establish equivalence classes that might be used to reduce BTB pressure.

The most obvious thing is to put an "equivalence class number" into the indiect CALL, and use that to index an "equibalence class BTB" at decde time.

However [[CALL-INDIRECT-WITH-SKIP]] accomplishes this more flexibly

:: push current-pc + skip
:: pc := reg (or pc := load M[ reg + offset ]
:HINT: use same branch (target) prediction as some other instruction

Two new ideas: CALL-INDIRECT-WITH-SKIP and hint to use same prediction as some other instruction.

Saturday, August 21, 2010

Dell Streak gets leaked Android 2.1 update in the UK, but still the same ol' 1.6 in the US -- Engadget

Dell Streak gets leaked Android 2.1 update in the UK, but still the same ol' 1.6 in the US -- Engadget: "Dell Streak gets leaked Android 2.1 update in the UK, but still the same ol' 1.6 in the US"

I haven't been so excited about a device in years.

But, not so excited that I don't want to hold off, either until the 2.1 update, or until I can actually try the Streak. The complaints about its lousy performance -- 3 seconds to refresh a screen :-( -- are quite offputting.

Monday, August 16, 2010

Editing Email user interfaces - Andy Glew's Wiki

= A User Interface for Just a Few Tags =

Here's what I think I need in order to process email efficiently. Rather than Gmail's single checkbox per message, instead I want multiple, a small number of checkboxes, typically associated with user configurable labels:






Keep (Important)




I would drag my usual labels or folders into the rightmost columns. I can handle a few such at a time, via checkboxes.
I would scan the full list, clicking checkboxes as I go,
and then hit "Perform".

Sure, there might still be leess frequently used labels and folders. But the most important half dozen are immediately visible.

= Sorting, not Searching =

Much as I like using the pen, it is still too annoying to click through my email in time order.

To really catch up on my email,
I gave up on Gmail, and fell back to Thunderbird's Imap client.
Nice that it allows me to handle my email, and then switch back to gmail.

The key to catching up on such an overflowing Inbox is to sort my messages.
E.g. sort by "From:". And then quickly go through all of the email from frequent flyer programs, and file it as pseudo-spam.
Similarly for certain subjects.

Basically, such sorting amounts to simple, ad-hoc, filtering.
This is rather like an extension of one of my favorite sayings:
    The most important thing is to be able to automatically filter your email.
    The next most important thing is to NOT have to write the automatic filters. :-me 2010

Actually, alphabetic sorting is not really what I want. Important single item topics get sandwiched between big groups of pseudo-spam.

Instead, what I want in the user interface is
* GROUP BY something like "From:" or "subject:"
* SORT GROUPS BY number of items in the group
* sort within the group by date, etc.

Naming and not naming

Mottoes and Sayings - Andy Glew's Wiki: "The most important thing is to be able to give things names.
The next most important thing is to not be required to give things names.
-me 2010 (actually, originally probably 1995)"

Monday, August 09, 2010

Apple holds a conference of shame - The Inquirer

Apple holds a conference of shame - The Inquirer: "Next up Jobs lied and said that other phones had similar problems. He even put up a slide comparing the Iphone 4 with other smartphones that work, such as the Blackberry."

I was poking around The Inquirer, clicking through to this somewhat old link, when I realized:

I have a Blackberry.

I have been plagued with random errors when I hold it in my hand. My LEFT hand. Errors that don't occur when I hold it in my right hand, or leave it on a desk to use as a speaker phone. Errors that manifest as the phone going blank, as if there was a short or a loose battery connection. Errors that basically make this particular phone - hmph, I can't easily tell what model it is - useless as a handset. Errors that IT cannot easily reproduce - I wonder now if it is because they are right handed.

(I am right handed, but I do many things left handed. Including holding my cell phone.)

Most likely I really do just have a loose battery or case. Can't tell if it is a systematic design flaw without gathering much more data.

But would it not be funny if Jobs was right?

Sunday, August 08, 2010

Graffiti For Android Gets Updated But Has a Bug�|�Gear Diary

Graffiti For Android Gets Updated But Has a Bug�|�Gear Diary: "Graffiti for Android"

I have long talked about

* how I found my various pen based systems - W4P Compaq Concerto, Toshiba Portege 3500 nad M400, various Palm PDAs, even my old Motorola Min A1200 and/or Microsoft AT&T<>

* how I miss Palm graffiti

* how I am interested in the Dell streak

* how I am interested in Android

These things I want may be getting together: Graffito for Android is available.

I'm not sure whether "handwriting"" with a fingertip will be less pleasnt than witgh a pen/stylus. But I am looking forward to trying.

Thursday, August 05, 2010

Chrome/Adobe PDF reader errors

I love the Chrome browser, but, in combination with Adobe Reader it frequently hangs reading PDFs. "There is an error reading the document", click OK and the same error.

Probably an Adobe error, but nevertheless the fact that it works with FireFox and IE but not Chrome looks bad.

Table indexing - CompArch

Table indexing - CompArch: "Table indexing
When I say that 'a table is indexed by a value V', I do not mean that all of the bits of V are used to form an address. Although that could be done, for something like a 64 bit address it could be a very large address.
Rather, I mean that the value V is used to determine what entry in the table is to be used. Some bits of V may be used as a number, to index the table. I call that RAM indexing. It may also use a hash of V.
Some or all of the bits of V may be matched against addresses already in the table. This is called CAM addressing.
RAM indexing and CAM indexing may be combined. E.g. RAM indexing may determine a set containing N tags, which are compared against the other bits of V to determine which entry within the set (often called which way of the set) should match.
For that matter, I also call it CAM indexing if bits of V are used to determine 1 of N bits, and that 1 of N code is compared to multiple entries in a table. See Decoded CAM versus Encoded CAM.
Thus, the term 'indexing' is general, not limited to the specific form of RAM indexing"

Cache Ways and Sets: opposing definitions

I have observed the following confusion in terminology wrt the meanings of "way" and "set" in a cache. I am most familiar with the terms as used at Intel, and earlier than that at Motorola and Gould.
I have encountered the alternate definition mainly with people [[DEC]] backgrounds,
although I am not sure that it is endemic.

First off, both camps agree on what is a 4-way associative cache.

E.g. let us discuss a 4-way associative, 64KiB cache, with 64B cache lines.
and a 64 bit physical address, with
* bits 0-5 are the *offset* within the cacheline
* bits 6-15 are the *index* within the cache
* the remaining bits, 16-63, are the tag

= My Terminology =

In the Intel terminology, the 10 index bits 6-15 select one of 1024 (2^10) "sets" within the cache.

Each set holds 4 cache lines. The tag bits of the address requested are compared to the tag bits of the 4 ways within the cache. If there is a match, the data for that way is accessed.

I.e. there are 1024 sets.  Each set holds an arbitrary subset of the 2^48 possible cachelines that map to that set, because they have that set's particular index field.

A way is one of the 4 elements within a set.
Although a set is an unordered collection,
we number the ways for reference:
 { set[index].way0, set[index].way1, set[index].way2, set[index].way3 }

A way can also refer to the collection of all of the ways of a particular number,
for all sets of all indexes.
It might be more correct to refer to this as a "way array"
 way0_array = { set[i],way0 } for all i = 0..2^48-1

but the term "way array" is seldom used.
Usually, way is disambiguated by context:
* "the processor allows you to lock down an entire way" (of the entire cache, across all sets of the cache)
* versus "address X is in way0 of set Y of the cache"

= Alternate Notation =

The alternate terminology, which I have observed mainly from DEC folk,
is to say that a 4-way associative cache has 4 sets (corresponding to "way array" above),
and that the i-th way is the set of the i-th element of each set,
where i is the index field of the address.

I prefer the other terminology, where "set" is unordered, as in mathematics.

Wednesday, August 04, 2010


Attempting to describe some microarchitectures in text-only email to a friend inspires me to post this. It would be much better to have lots of drawings illustrating precisely what I am talking about; but this textual notation is fairly compact, can be understood by some people, and in some ways, because it underspecifies certain configurations, allows us to talk without getting bogged down in multiple different ways of interconnecting the blocks. I.e. sometimes drawing is TOO specific.

= Basic Pipelines =

Let's start out with in-order pipelines
and out-of-order pipelines

A typical in-order pipeline might look like
 InO = BP -> I$ -> D -> RD -> X -> L1$ -> WB
with pipestage names
* BP = branch prediction
* I$ = instruction cache
* RD = register read
* D = decode
* X = execute
* L1$ = L1 data cache
* WB = writeback

A prototypical out-of-order pipeline might look like
 OoO = BP -> I$ -> D -> RR -> S -> X -> L1$ -> IW
with additional pipestages
* RR = register rename
* S = schedule
* IW = instruction window - ROB, etc.

Where are the PRF reads?

On a P6-style read-before-schedule & capture machine:
 OoO = BP -> I$ -> D -> RR -> RD -> S -> X -> L1$ -> IW

On a HaRRM (1987) or Wmt-style read-after-schedule machine:
 OoO = BP -> I$ -> D -> RR -> S -> X -> RD -> L1$ -> IW

I am not going to go into details about this, or the ROB/PRF, because that's not my topic of discussion here.

Actually, the execution units may be superscalar, or they may share the same single port.
I could contrive ways of drawing this, but again it's not my real point.
Similarly, the L1$ is often accessed in parallel with the first few cycles of X.
I might depict this as
  X || L1$
or just
  X L1$
rather than
  X -> L1$

In fact, eliding the arrows overall makes things less cluttered
A typical in-order pipeline might look like
 InO = BP I$ D RD X L1$ WB
 OoO = BP I$ D RR S X RD L1$ IW

and it allows us to throw other stuff on, such as the L2$
 InO = BP I$ D RD X L1$ WB       L2$
 OoO = BP I$ D RR S X RD L1$ IW    L2$

we can stretch the notation to indicate store queues, etc; it is nice to use the DEC/AMD term LSU, even though Intel tends to split them up

= SMT =

SMT, symmetric multithreading, like Intel Hyperthreading, runs multiple threads on the same out-of-order pipeline.
 SMT2(OoO) = SMT2(BP I$ D RR S X RD L1$ IW)    L2$
Not indicated: the fact that I$ is shared, but renamer is replicated, etc.

We might also have SMT in-order machines, like Intel's first generation Atom:
 SMT4(InO) = SMT4(BP I$ D RD X L1$ WB)       L2$

By comparison, we can emphasize the single threaded nature of the orignal pipelines:
 ST(OoO) = ST(BP I$ D RR S X RD L1$ IW)    L2$
 ST(InO) = ST(BP I$ D RD X L1$ WB)       L2$

= Multicluster =

The original multicluster microarchitectures that I am familiar created wider superscalar machines by replicating the execution units.  They were multicluster in that they did not have uniform bypassing: some might bypass within the cluster fast, but impose an extra cycle bypassing to other clusters; some might not permit bypassing at all, except perhaps through special inter-cluster instructions.

We can attempt to draw this in ascii art as

 MC(OoO) = BP I$ D RR S X RD L1$ IW    L2$

 MC(InO) = BP I$ D RD X L1$ WB         L2$

but I will instead draw it as

 MC(OoO) = BP I$ D RR S MC2(X) RD L1$ IW    L2$
 MC(InO) = BP I$ D RD MC2(X) L1$ WB         L2$
The question always arises as to how much is replicated in the clusters:
 MC(S X)
 MC(S X RD L1$)
This last will be revisited in MCMT.

Execution units can be shared between clusters, like the FPU in AMD Bulldozer.
It's a pain to depict this in ascii art, so I won't bother here.

Should the renamer be inside the clusters?  Depends. It's easier to maintain single thread semantics if the renamer is outside, and similar the LSU:

 MC(OoO) = BP I$ D RR MC2(S X RD L1$) LSU IW    L2$

But since the S X LSU unit is as critical as the S X L1$ loop, it is natural to put it inside. But then coordinating the two clusters becomes a pain.  You might, e.g., copy all stores to both clusters' LSU.
I don't know how to draw that in ascii art.

 MC(OoO) = BP I$ D RR MC2(S X RD L1$ LSU) IW    L2$

You start down the slippery slope of complexity if you have an L1 LSU iside the cluster for speedd,
and an L2 LSU outside the cluster to coordinate things.
Note that the LSU2 need not have any data.
 MC(OoO) = BP I$ D RR MC2(S X RD L1$ LSU1) LSU2 IW    L2$

We'll assume that IW is outside the cluster for single thread

= MCMT =

Once you have multiple clusters, you can run multiple threads on them.
One thread per cluster is natural. It permits the nice simplification of not having to do bypasses between clusters.

 MCMT2(OoO) = BP I$ D RR MCMT2(S X RD L1$) IW    L2$

With separate threads, an LSU inside the cluster is natural
 MCMT2(OoO) = BP I$ D RR MCMT2(S X RD L1$ LSU) IW    L2$
but the stuff talked about above may apply.

For separate threads, it might be natural to put IW inside the cluster.  But IW, instruction window management such as ROB, is not in the critical loop.  If you put IW outside, you might be able to share non-uniformly, e.g. giving simple threads 1/4 of the IW, and complex 3/4.
I call one way of doing this a [[segmented sequential instruction window]].

= MCMT and SMT =

You could make each cluster itself SMT:


although now you start driving the ifetch prety hard.

= Speeding Up Single Threads =

SMT and MCMT are ways of supporting multiple threads. It would be nice if they could speed up single threads.

MC is used for single threads.  However, it is a challenge to do scheduling for truly decoupled MC microarchitectures.  One additional cycle of bypassing can be afforded, but it gets harder the more decoupled the clusters are.

My old, old, name for using multiple threads to speed up a single thread was IMT, implicit multithreading.

SpMT, speculative multithreading, is the form of IMT I am most familiar with,
There are other forms of IMT - more fine grained than SpMT, with executes contiguous sections of the instruction such as loop bodies and regions separated by CALLs and RETurns.

IMT and SpMT need to run on a multithreaded substrate.
Above we have described two multithreaded substrates
with the obvious addition of
Any of which can be InO or OoO. MCMT can be built with SMT inside the clusters.
And MP, multiprocessor, separate cores, can be built out of single threaded cores, multithreaded cores, etc.
 MP =

You can therefore run, or at least think about running, IMT or SpMT on any of these multithreaded substrates:

Overall, SpMT(MP(ST)) is hard.  It is hard to overcome the inter-core latencies when forking threads.

MP(SMT) is doable - indeed, it is Haitham Akkary's DMT.
However, SMT is often limited to only 2 threads, which really does not give you much opportunity to
speed things up with SpMT. And the speculative threads tend to interfere with the non-speculative thread. First, do no harm.
::SpMT(SMT2) - low potential

I motivated MCMT as a way of building a multithreaded microarchitecture for SpMT, that might be able to support more threads.  Alternately, I motivated MCMT by itself, without SpMT, as something intermediate between SMT and M.

SpMT(MCMT(ST)) is okay, but there is still probably overhead to go between clusters, felt on forking.
By the way, SpMT(MCMT(ST)) probably has lots of other stuff added to preserve sequential semantics.
Won't discuss that here, although I will mention that I am found of log based execution.

SpMT(MCMT(SMT2)) is a potential sweet spot.  I think the normal operating mode would be with one thread per cluster, but the SMT2 allows a "run thread" to be used to fork off the non-speculative thread, or a less speculative thread.  The runt thread could cycle steal transferring state from cluster to cluster, and thus minimize slowing down the non-speculative thread.

SpMT(MP) is hard, as mentioned above.

But SpMT(MP(SMT2)) is another sweet spot - with the runt thread trick again played to help forking.

So, we have two attractive configurations:
with the inner SMT2 used to speed up forking.

SpMT(MCMT(SMT2)) is probably smaller,
but  SpMT(MP(SMT2))
makes better use of the many, many, cores we have nowadays.

Combining both in
is imaginable,
but it is more complex than doing just
Not worth doing unless there is some big win.
Or unless you want to fill a product line hole with MCMT(SMT2)
- e.g. something larger than a single core ST or SMT2(St), but smaller than MP2.

= Multilevel Schedulers, etc. =

TBD: S2 S1 schedulers, etc.

TBD: batch scheduling across clusters versus fine grain de-interleaved scheduling.

= See =

Wiki at