Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, August 24, 2010

Welcome, Andy! | LinkedIn

My LinkedIn status: Hotchips is over. AMD Bulldozer was bittersweet. Godson, from China, looks good, looks like it will give the US a run for the money. IBM Power7 PERC does propert scatter/gather with their GUPS remote atomic update instruction. Dark Silicon is the new buzzword. Explaining what Intellectual Ventures does 10 times a day strengthened my hope in IV's mission. Interacting with other computer architects feels good. Now back to work

Editing How to use the extra operands of indirect call instructions - CompArch

Editing How to use the extra operands of indirect call instructions - CompArch: "How to use the extra operands of indirect call instructions"


http://semipublic.comp-arch.net/wiki/How_to_use_the_extra_operands_of_indirect_call_instructions

= ISA Encoding Issues and Branches =

Instruction sets often support standard encodings with fixed fields for operands such as register sources and destinations, or [[immediate literal offset constant]] operands:
* OP: destreg := src1reg OP src2reg
* OP: destreg := src1reg OP immediate
* OP: destreg :OP= src1reg
* OP: destreg :OP= immediate
Control flow instructions usually don't fit this pattern. I.e. control flow instructions such as jumps don't really have a destination register, unless you count the program counter. Moreover, control flow instructions want to have a large scope - i.e. they want the immediate constant to be as broad as possible.

E.g while the following reuse of a standard format "fits":
* BRANCH pcreg += offset
and allows other register+=constant arithmetic to benefit, even on machines that normally have a three operand format (dest,src1,src2 reg or constant)
* ADD destreg += large-constant
it is even better to fold the bits used for destreg into the offset, and let the branch instructions have an implicit PC destination, or really src/dest for PC-relative branches.
* BRANCH even-larger-offset
:: pc += even-larger-offset

(This even-larger-offset is most easily used for unconditional branches. Conditional branches need sme bits to specify either the register operand containing the condition, or the nature pf the condition on a condition code ISA.

This discussion of even-larger-offsets applies to "near" or "local" branches and CALLs - usually PC-relative. Indirect branches and indirect calls do not necessarily need an even-larger-constant operand. So the question aises as to what instruction format should they have?

= Indirect Branch =

Probably simplest is to have a single register field format. For indirect branch:
* INDIRECT-BRANCH: reg
:: pcreg := reg

We won't talk any more much about indirect branch much after this. Single register field is good enough, except for possibilities such as
* INDIRECT-BRANCH-TO-JUMP-INSTRUCTION-TABLE: reg, offset
:: pcreg := offset (of a jump table) + reg
:: OR pcreg := offset (of a jump table) + reg<versus
* INDIRECT-BRANCH-TO-MEMORY-TARGET: reg, offset
:: tmpreg := load( offset-of-jump-table + reg )
:: pcreg := tmpreg
Note that these two flavors are especially suited to switch tables.

The former, INDIRECT-BRANCH-TO-JUMP-INSTRUCTION-TABLE, computes an address and jumps to it. Typically the table is a table of jump instructions.
The scale factor allows more than a single instruction to be in each entry in the table.
* PRO: This table can be embedded in the instruction stream without issues of mixing code and data, since it contains instructions. It can be PC-relative.
* CON: consumes extra entries in the branch predictor. Less efficient user of bits, since there will be jump opcodes.

The latter, INDIRECT-BRANCH-TO-MEMORY-TARGET, isn't really a RISC instruction, since it does a load and a temporary-register indirect branch. You can always synthesize it by separate instructions. As for INDIRECT-BRANCH-TO-MEMORY-TARGET
* PRO: smaller than separate instructions
* PRO: more efficient in use of space than INDIRECT-BRANCH-TO-JUMP-INSTRUCTION-TABLE
* CON: not RISCy

= Indirect CALLs =

Branch-and-link, a common RISCy form of a register indirect call has two register fields.:
* BRANCH-AND-LINK: linkreg := current-pc; pc := srcreg

However, calling conventions usually mandate a single fixed register for the link register. So it is suboptimal to waste bits specifying the link register:
* BRANCH-AND-LINK: fixed-link-register := current-pc; pc := srcreg

So now we have branch and link wanting to have a single register field. Similarly for a CALL instruction that pushes the old instruction pointer on a stack:
* BRANCH-AND-LINK: srcreg
:: M[sp] := current pc
:: pc := srcreg
:: sp -= sizeof(pc)

(Note that such a call is transactional: you may not want to have modified the stack if the calructions as well.l fails, etc. Considerable ingenuity goes into creating CALL instructions with one, or the fewest possible, microinstructions.)

If you want to support a monadic register operand instruction, great. You might want to use it for other instructions as well.

However, there may be ways of using extra operands, particularly literal operands, for CALL indirect.

For example, many indirect CALLs are to the member functions of object classes. These are obtained by a vtable lookup:
* CALL reg+offset
:: tmpreg := load M[reg + small-offset]
:: pc := tmpreg

This is like BRANCH-INDIRECT-TO-MEMORY. Difference: with branch indirect used for switch tables, the register tends to be a small offset into the table, whereas for method calls the register tends to be a full range address of a heap object.

The usual pros and cons:
* PRO: smaller than separate instructions
* CON: not RISCy



Q: what else can we use extra operands on CALL indirect for?



= Default Indirect Target =

One of the biggest problems with indirect function calls (and to a lesser extent indirect branches) is target prediction.

On a machine with a short pipeline, you might only need a [[branch direction predictor]], taken/not-taken.
You might be able to get away without a [[BTB]], if the decoder arrives soon enough after instruction fetch,
since for most control flow transfers the branch target (or call target) is a constant, embedded in the instruction stream.

Not so for indirect transfers: the indirect branch or call target is not in the instruction stream.
Even if it is already known, it is hard to access.

On machines with deeper pipelines, [[BTB]]s (branch target buffers) are needed so that instruction fetch can be resteered even though the branch target from the instruction stream has not been decoded. But there is still an advantage to having the branch or call target in the instruction stream: the decoder can detect a BTB miss, and resteer, without waiting for execution.

Trouble is, every indirect branch or call needs a BTB entry.

Most instruction sets admit PC-relative or CS-relative or absolute function addresses.
Things are made even worse on instruction sets, such as the old IBM 360, where most CALLs are base register relative
- implying nearly all branches need a BTB entry.

But the transition to object oriented code makes this worse. Method calls, virtual function calls, are indirect, and hence require BTB entries.

What is especially frustrating is that oftentimes we know what the most probable indirect branch or CALL target is.
So we will proceed to discuss how to convey this knowledge across the instruction set to the branch prediction hardware.

== Hint ==

The most obvious way is to use HINT instructions. E.g.
:HINT: call target = F00
:BRANCH-AND-LINK: (reg)

The hint instruction can be scheduled in advance.

== Front-end Registers and Prepare-to-Branch ==

Some instruction sets try to give the compiler explicit control by providing a set of front-end registers. You move an indirect branch or call target to such a register, and then branch. Can be separated. Unfortunately, often hard to separate: consider a pointer chasing loop calling a virtual function.

== Other Tricks ==

In the P6 (Pentium Pro) Code Optimization Handbook I recommended that switch table code be generated as follows:

ideally:
:JMP (reg)
:default-target follows jump instruction

less ideally:
:JMP (reg)
:JMP default-target



First, because there was no HINT instruction in x86 at that time.

Second, more important, because at least instruction fetch would have prefetched the default target in the former case,
while decoder shortstop branch prediction might do the right thing in the latter case.

Arguably, these are better than HINT instructions, since they are not executed if branch prediction is working.
They only kick in if branch prediction is not working.


Unfortunately, these techniques cannot be used for indirect CALL.
It is impractical to rearrange function code to fall inline after the call instruction. especially if separately compiled.
And you cannot do

:CALL (reg)
:JMP default-target

or

:CALL (reg)
:CALL default-target

because the function return will return to the instruction after the original CALL.


It is tempting to try to cram a default target into the CALL instruction as an extra operand. But that's like a hint: it wastes bits if predicting correctly.


Here's a slightly less wasteful alternative: give CALL the ability to say that the next few instructions should be skipped on return:

:CALL-INDIRECT-WITH-SKIP (reg), skip
:: push current-pc + skip
:: pc := reg (or pc := load M[ reg + offset ]
:CALL default-target
:skip:

or

:BRANCH-AND-LINK: (reg),skip
:: linkreg := pc + skip
:: pc := reg (or ...)
:SOMETHING-WITH default-target
:: could be a hint, or a branch, or ...
:skip:

This is not like x86 RETn: that specifies an stack offset. This is a code offset.

It is better to do this at CALL site than RETurn site, since it doesn't affect calling conventions.

Let's call this [[CALL-INDIRECT-WITH-SKIP]]. or that matter, could have [[BRANCH-AND-LINK-INDIRECT-WITH-SKIP]], although the latter is not really necessary.

The skip is really probably only a few instructions. I.e. it is not a very big extra operand. Maybe it should be full width: this gives us a sort of threaded code style.

:META-COMMENT: what we are seeing here is that, in some situations, it is better to move HINT instructions out of the code. In this case, in the code stream, but in a place where it would not normally be executed except by mis-speculation. Actually, in all cases of correct prediction, hint instructions are wasteful. There have been proposals to move hint and other meta-instructions ad directives completely out of line, to a code stream paralleling the original code stream. But this has complexity. Hiding hints after unconditionally take branches in this manner is easier to manage.




== Object Class Indication ==

Here's another way of using the extra operands for indirect CALLs.

Oftentimes, objects with virtual functions go to a single common place, that can be predicted early - perhaps at compile time.
However, with separate compilation and linking, the virtual functions may not really be known until link time.
Still, we can talk about having linkers patch the code to provide the possible hints.

But in some situations the virtual functions are not known at link time.
I.e. object inheritance is genuinely being used:
polymorphic objects with different virtual functions for the same method may be in use.

If the objects really are randomly and dynamically polymorphic, there isn't much we can do.

But oftentimes, polymorphism is being used, but is restricted or semi-constant. E.g. you might use a factory to create different polymorphic objects for an interface, but thereafter all instances of that object would be the same. I.e. the default virtual function target is not known at compile or link time, but is constant for the life of the program for objects of that class.

BTBs handle this, but at the cost of one BTB entry per virtual function call.

Here's a slight;y offbeat idea, slightly new to me: It may be advantageous for a compiler to inject a level of indirection:

Instead of
:...
:CALL (reg)offset // using virtual function table
:...
:CALL (reg)offset // using virtual function table
:...
:CALL (reg)offset // using virtual function table
...

using a separate BTB entry for each


it may be better to do:


:...
:JMP local-virtual-function-proxy-offset // using virtual function table
:...
:JMP local-virtual-function-proxy-offset // using virtual function table

:JMP local-virtual-function-proxy-offset // using virtual function table
:...

with
:local-virtual-function-proxy-offset: // using virtual function table
:: CALL (reg)offset

* PRO: a single BTB entry for the actual CALL
* CON: BTB entries for the jumps
** PRO: the jumps can be decoder predicted if BTB missing
* CON: would have to arrange for all indirect sites to use same register

It may be possible to use the extra operand of an indirect CALL to accomplish the same thing.

Basically, we are trying to establish equivalence classes that might be used to reduce BTB pressure.

The most obvious thing is to put an "equivalence class number" into the indiect CALL, and use that to index an "equibalence class BTB" at decde time.

However [[CALL-INDIRECT-WITH-SKIP]] accomplishes this more flexibly

:CALL-INDIRECT-WITH-SKIP (reg), skip
:: push current-pc + skip
:: pc := reg (or pc := load M[ reg + offset ]
:HINT: use same branch (target) prediction as some other instruction
:skip:

Two new ideas: CALL-INDIRECT-WITH-SKIP and hint to use same prediction as some other instruction.