Krazy Glew's Blog

Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, September 06, 2010

Ulfs - my user level filesystem - Andy Glew's Wiki

Ulfs - my user level filesystem - Andy Glew's Wiki: "Ulfs - my user level filesystem"

ulfs - my User Level FileSystem.

wiki: http://wiki.andy.glew.ca/wiki/Ulfs_-_my_user_level_filesystem

blog: ...

Have wanted to code this up for a long time.
Never could, because anything I did belonged to my employer.
Want it open source, so that I can use it for my personal stuff - wikis, version control.
With my current employment agreement I can work on open source, so starting.

Today got the most trivial put, get, and delete coded & tested.
Not much, but its a start.

I want a user level filesystem, with at least two bindings
* UNIX command line
* Perl
I imagine that I will eventually want python, javascript, etc.
TBD: generate such via swig?

I want command line, because I'm a UNIX old fart. I still think of the shell as a good way to compose operations, e.g. via pipes.

I want scripting languages like Perl, because much code will be in there.
But mainly because I will implement at least first version in Perl.

Storage types:
* ordinary UNIX (or Windows) hierarchical filesystem tree
* archive formats, like tar
* possibly VC tools, like git, hg, bzr, ...
** delta storage
* possibly databases, even relational
* possibly hybrid filesystem/database

: any of the above may have flavors - e.g. will have metadata in the ordinary filesystem

Filesystem Names
* use the name specified by the user
* allocate and return a name
** sequentially
** content based hashing
*** handling collisions
* metadata queries
:: all of the above, both explicit, implicit, and content based
* versioning

Command Line Interface:

mkfs:
ulfs -mkfs test.fs -type ordinary
ulfs -mkfs test.fs.tar -type tar

my $ulfs = ULFS_Ordinary::mkfs("test.fs");
my $ulfs = ULFS_Tar::mkfs("test.tar.fs");

my $ulfs = ULFS::openfs("test.fs");

basic operations
ulfs -fs test.fs -to testdata1-in-fs < testdata1 -put
ulfs -fs test.fs -from testdata1-in-fs > testdata1.retrieved -get
ulfs -fs test.fs -from testdata1-in-fs -delete

$ulfs->put(-to=>testdata1-in-fs,-input=>testdata1);
$ulfs->get(-from=>testdata1-in-fs,-output=>testdata1);
$ulfs->delete(-from=>testdata1-in-fs;

other operations
copy
move/rename
mkdir

As is my wont, some thoughts:

;Predicates
If I try to use this as a net or web based filesystem, I may want to have predicates specified by the client
executed by the server. E.g. compare-and-swap: put this file, but only if the old version's content is ...

Operations like rmdir are really simple predicates: only remove the name if that name is a directory.

Friday, September 03, 2010

apenwarr - Business is Programming

apenwarr - Business is Programming: "The sad evolution of wikis"

Amen.

Wikipedia is not wiki.

But... I'm contributing. I want comp-arch.net to be a controlled wiki - in part because I've already been spammed, but also because I eventually want to make it into a book. Or at least the wiki equivalent of one.

Also because i have specific contractual requirements. Certain things i can write now, but not publish for a few years. Because I'm lazy, I want to write them on the wiki now, and have them go public later.

Hence my half-assed semipublic/public comp-arch.net. And my dreams of a more complete solution.

The original poster said:

How do you create a vibrant community, but allow for private topics and discussion, but allow for public topics and discussion, and allow me to work for more than one company at a time with multiple private discussions, and have my WikiWords always end up pointing where they're supposed to?

I have no idea.

I actually have quite a few ideas along this direction.

I actually have arranged, in my employment contract, the legal right to work on such, even while employed.

Unfortunately, getting the time to work on this is my problem. What time I have, I tend to spend on comp.arch or comp-arch.net. I.e. content. Improving infrastructure requires blocks of time I only get on vacation.

Monday, August 30, 2010

Memory Transaction Topology

* wiki: http://semipublic.comp-arch.net/wiki/Memory_Transaction_Topology
* blogged, comp.arch posted, hbc mailing list.

Let us discuss the simple shape or topology of memory transactions,
from the point of view of a simple interface, with a processor on one side,
and memory on the other.

When I have drawn such diagrams in the past, I have often used the vertical axis to indicate the passing of time.
Here, drawing in ascii, I will not bother to do so.

= Simple Memory Accesses =

;Simple read
* processor sends an address to memory
P ---A---> M
* sooner or later, memory responds with the read data
P <---D--- M ;Simple Write * processor sends address and data to memory P ---AD--> M
Acknowledgement may or may not be necessary; it may be [[store and forget]].

The store address and data may be multiplexed oin the same wires, asis roughly indicated above.
Or they may be in parallel.

* processor sends address and data to memory
P -+-A-+-> M
+-D-+ M

Or they might even be two separate transactions, joined at the memory

P ---A---> M
P ---D---> M

= Interleaved Memory Accesses =

;Interleaved read

Interleaved reads may be satisfied in the original order
P ---A1---> M
P ---A2---> M
P <---D1--- M P <---D2--- M Or out of order. If ooo, then some sort of transaction ID is needed: P ---A1---> M
P ---A2---> M
P <---D2--- M P <---D1--- M Simple writes with address and data available simultaneous do not provide opportunity for interleaving. P ---A1D1--> M
P ---A2D2--> M

However, separate address and data transfiers do:
P ---A1---> M
P ---A2---> M
P ---D1---> M
P ---D2---> M

or out of order:
P ---A1---> M
P ---A2---> M
P ---D2---> M
P ---D1---> M

We usually assume that the address is sent first, but this need not be the case.
P ---A1---> M
P ---D2---> M
P ---D1---> M
P ---A2---> M

Usually, however, interleaving or reordering of stores refers to with respect to the order of stores between different proocessors. TBD.

= Burst Accesses =

Memory transfers often occur in cache line sized bursts, frequently 4 or 8 clock cycles, sometimes more.
Interleaved reads may be satisfied in the original order
P ---A------> M
P <---DDDD--- M Stores may have address and data near to each other P ---ADDDD--> M

Decoupled burst store address and store data is also a possibility, but is not common.

= Multiple Transaction Sizes =
Thus we have severakl different transaction sizzes

;Address Only
P ---A------> M

; Burst Read Data Reply
P <---DDDD--- M ; Burst Store P ---ADDDD--> M

;Various non-burst or partial
in 8b, 16b, 32b, 64b, 128b, ... sizes
as well as possibly misaligned transfers
P <---D--- M = Read-Modify-Write = The following read/write transaction is usually associated with atomic read modify write operations. (e.g. the GUPS benchmark) P ---AopD---> M
op
store new data value
return old
P <---D------ M Note that this can also be done in combination with a cached value P $ ------------AopD---> M
op op
store store new data value
ignore (or check) return old
P $ <-----------D------- M == Modify Write == It is possible to have a modify write transaction, where an operation is performed at memory, but the old value is not returned. P ---AopD---> M
op
store new data value
X <-- M Aread modify without the write is even less useful; it really amounts to having an implicit operand that does not need to be transferred. = Masked Write, including Burst Masked Write == It is common to have a bitmask specifying which bytes should be written: P ---AmD---> M
The mask is so small that it is often part of the command.

Often a read-modify-write is necessary at the receiving end, e.g. for ECC.
But nevertheless, not across the interconnect.

Less commonly there is a masked read, e.g. for side-effectful memory.

There can also be burst masked writes, where the mask may approach and surpass a data chunk width.
P ---AmDDDD---> M

With [[Bitmask consistency]], burst masked writes become the norm.
Reads that indicate which bytes are an absolute requirement,
whether burst masked reads or otherwise, may also be useful.

= Scatter/Gather =

Rather than separate non burst reads
P ---A1---> M
P ---A2---> M
P <---D2--- M P <---D1--- M ... or writes, or RMWs We might take advantage of burst transfers, even though not necessarily burst memory (DRAM) accesses P ---A1.A2.A3.A4---> M
P <---D1.D2.D3.D4--- M P ---A1D1.A2D2.A3D3.A4D4---> M

P ---A1opD1.A2opD2.A3opD3.A4opD4---> M
P <----------D1.D2.D3.D4------------ M For the RMW casse we might constrain the same op to be used everywhere: P --op-A1D1.A2D2.A3D3.A4D4---> M
P <----------D1.D2.D3.D4------------ M = BLUESKY: Reducing the Transaction Types in a Scatter/Gather Memory Subsystem = ; read P ---A---> M
;gather read
P ---A1.A2.A3.A4.A5.A6---> M

; write
P ---op.A.mask.DDDD---> M
; scatter
P ---op.A1D1.A2D2.A3D3---> M

; reply
P <---id.DDDD--- M If A and D chunks are the same size, * the "normal" burst is 6 chunks long (A.mask.DDDD) * with 6 gather read addresses * or 3 scatter write or RMW address/data pairs We might even do away with partial reads (---A--->).
Possibly make all replies look like writes.
; reply
P <---op.A.mask.DDDD--- M

Further, we might compress the values transferred.
Compression does not help fixed size transactions with fixed size memory cache lines,
except for power.
But compression does help fixed size cache lines with variable length transfers.
And compression will allow more scatter/gather requests to fit in a single packet.

In general, we would assume that in a multistage interconnect s/g packeets will be broken up and routed s/g element by element.

Circuit switched, e.g. optical, works against this.

Thursday, August 26, 2010

New variant of old Windows graphics "window contour trails" bug

This morning, just now, encountered a new variant of an old bug.

I have often run into what I call the Windows contour trail bug: when dragging windows around the screen, they leave a trail of old, partially overwritten, window contents. Looking rather like mouse trails. As if the windw was being redrawn every few fractions of a second, each time overwriting the drawing of the window at its old position, and the background was never replacing the overwritten window contents at the old position. Fills the screen with rather pleasant abstract art.

Unfortunately, a reboot usually seems to be necessary.

Today, same Windows contour trail problem - but on only one monitor. I have 4 monitors connected to that machine. The laptop LCD panel, and 3 USB display adapters. Only one of the monitors, connected to only one of the USB display adapters, has gone buggy.

Which gives a few hints: its not a problem with the overall graphics display infrastructure. Apparently just with one display driver.

... While writing this, apparently fixed itself. Possibly related to going in and out of standby. Possibly related to Windows Explorer desktop shell hanging on a network drive.

---

All sorts of thoughts about micro-rebooting running through my head.

Wednesday, August 25, 2010

Hint instructions - CompArch

Hint instructions -

CompArch: "Hint instructions"

http://semipublic.comp-arch.net/wiki/Hint_instructions

Oftentimes the hardware, the processor (or memory subsystem, or ...) microarchitecture,
wishes to know information that the programmer or compiler might already know.

Similarly, oftentimes the compiler wishes to know something that the programmer might already know.

Conversely, oftentimes the compiler wishes to know something that the hardware might already know,
if the program has already been executed several times. This last is my cute way of trying to fold
[[profiling feedback]] into this discussion.

Such questions might include:

* "Is this conditional branch more likely to be taken, or not?"
* "Is this load likely to forward from an earlier store, and if so, which?"
* "Is this load likely to be a cache miss?"

TBD: create a canonical list of operations for which hints have been proposed.

= Hints to compiler =

Computer architecture folklore, that probably has a basis in fact (TBD, get reference):
the original FORTRAN compiler allowed the programmer to say which alternative (of its 3-way arithmetic IFs) was most likely.
I.e. it allowed the programmer to give a hint to the compiler.
The story goes on to say that the human programmers were usually wrong.

In any case, there are often ways of conveying hints to the compiler. #pragmas are one example.
[[Profiling feedback]] is another.

Having received a compiler hint, the compiler can change code generation, e.g. the order of basic blocks. Or it might choose to implement the [[Hint instructions]] to hardware described here.

GLEW OPINION: hints from human to compiler should allow at least two representations: (1) inline, and (2) separate.

;Inline:

It should be possible to provide such hints in the source code, as close as possible to where it applies:

...
#pragma IF-THEN probablility = 90%
IF ... THEN
...
END IF
...

I am agnostic as to whether the hints or #pragmas should be before the statement, or inside the statement

...
IF ... THEN
#pragma IF-THEN probablility = 90%
...
ELSE
#pragma IF-THEN probablility = 10%
...
END IF
...

However, it should be noted that designing a pleasant and unambiguous set of #pragmas is a hard language desgin issue in the first place.

I am NOT agnostic about the following: there should at least two representations: (1) inline, and (2) separate.

;Separate
Sometimes the programmer does not have the ability to modify the source code.
Then it is necessary to be able to specify the hints separately,
e.g. on the command line
cc -pragma "foo.cc line 22 IF 90%" foo.cc
or in a separate file
foo.cc line 22 /if( c1 < c2 )/ probability 90%
bar.cc line 19 /.../ probability 10%

It is, of course, clumsy to specify the location where hints should apply in a separate file.
Line numbers, possibly with regexps to apply when the line numbers change by editing, may be required.

But it is necessary.

= Hint Instruction Set Embeddings =

How can hints be embedded in the instruction set?

* Fields of non-hint instructions
** CON: takes bits away from instruction encodings that are already tight.
** e.g. BRANCH-AND-LINK, with coroutine versus call semantic hints

* Separate HINT instructions
** CON: takes instruction fetch bandwidth
** PRO: more bits available, and does not waste bits if hints are not used

* Separate HINT instruction stream
:: Just as I recommend that compiler hints can be placed inline versus out of line, it has been proposed by some researchers to have hints, and possibly other operations, in a separate instruction stream, paralleling the normal instruction stream. Just as with compiler hints in a separate file, issues arise as to how to establish the correspondence - typically by IP.

* Hint prefixes
On a variable length instruction set like x86, certain prefixes can be applied to existing instructions. If the prefix is oroginally a NOP, it can be used to provide HINT semantics.
Similarly suffixes, and other modifiers.
* PRO: the hint cost is paid only by the hint user
* CON: variable length instructions ad prefixes are a pain to decode

See [[Call skip and hints]] for a proposal to hide hint instructions in the shadow of unconditionally take branches, to reduce overhead with less complexity than a hint instruction stream.

Another intermediate form is to have hint instructions in the instruction stream, but in a form that allows them to be hoisted out of loops and far away.
E.g. instead of
...
loop:
...
HINT: next loop branch is taken 32 times
jnz loop
...
One might do
...
HINT: loop branch at IP=label is taken 32 times
...
loop:
...
label: jnz loop
...

This makes the trigger for the hint more complicated, but reduces the instruction decoding pressure.

= Hint Instruction Set Encodings =

Probably the most important principle in instruction set design is to trap undefined encodings, so that they can be used for new instruction set extensions on new machines, and so that those new instructions can be trapped and emulated on old machines.

However, a slightly less important principle is to, sometimes, reserve multiple flavors of NOP instructions.
Less frequently NOP operands.
Such NOPs can be reserved for hintability in future processors,
but are treated as NOPs by present processors.

Note that it is important that, although treated as NOPs by the hardware, the compiler be discouraged from using them,
from using any but the canonical NOP.
Should the compiler or programmer start using the different flavors of nop,
let's call them NOP_HINT_0 ... NOP_HINT_15 or similar,
to convey information (e.g. information to a hukan viewing an instruction trace in a debugger),
then it may be difficult to grant them specific hint behavior in the future.

= Prefetches and Hints =

Are prefetch instructions hints? It depends.

If prefetch instructions can be ignored, yes, then they are hints. You might choose to ignore a pefetch instruction if some other prediction mechanism overrides it.

If prefetch instructions have side effects, such as setting [[access or dirty bits in page tables]], or even taking [[page faults]],
then it is questionable whether prefetch instructions are hints.

If these side effects are optional, not required, then even such a [[prefetch instruction with architecturally visible side effects]] (*) can be considered hints. If these side effects are not hints, then the prefetch instruction really has significant non-hint consequences.

: * Note: WikiWish: I would like to have [[here labels]] - the ability to define a term inline, saying "here is the definition of a term", using some notation like double curly braces used to define wikilinks. Without all of the formality of creating a wiki page or section. Somtimes the best definitions are given in context, rather than broken out.

= Branch Prediction Hints =

== Call indirect default target hints ==

* [[Call skip and hints]] and [[How_to_use_the_extra_operands_of_indirect_call_instructions#CALL-INDIRECT-WITH-SKIP the discussion that led to it]]

= Security and Reliability are NOP HINTs =

It is possible to implement security and reliability instruction set extensions as NOP HINTs on older machines.
For example, Milo Martin's HardBound bounds checking that detects buffer overflow bugs.

Think about it: for a correctly written program with no buffer overflow bugs,
it should be possible to disable all such security checks, and the program run correctly.

Only incorrect programs would break if the security checks became NOPs.
(Or, malicious programs would break-in.)

Similar to
Milo Martin's HardBound bounds checking that detects buffer overflow bugs,
which is itself a mechanism for improving software reliability,
various RAS instruction set features can be treated similarly.
Such as "SCRUB ECC FROM CACHE LINE".

= Memory Ordering =

On a sequentially consistent memory ordering model, synchronization fences are hints or nops.

However, on weaker memory ordering models, they are not.

Tuesday, August 24, 2010

Welcome, Andy! | LinkedIn

My LinkedIn status: Hotchips is over. AMD Bulldozer was bittersweet. Godson, from China, looks good, looks like it will give the US a run for the money. IBM Power7 PERC does propert scatter/gather with their GUPS remote atomic update instruction. Dark Silicon is the new buzzword. Explaining what Intellectual Ventures does 10 times a day strengthened my hope in IV's mission. Interacting with other computer architects feels good. Now back to work

Editing How to use the extra operands of indirect call instructions - CompArch

Editing How to use the extra operands of indirect call instructions - CompArch: "How to use the extra operands of indirect call instructions"

http://semipublic.comp-arch.net/wiki/How_to_use_the_extra_operands_of_indirect_call_instructions

= ISA Encoding Issues and Branches =

Instruction sets often support standard encodings with fixed fields for operands such as register sources and destinations, or [[immediate literal offset constant]] operands:
* OP: destreg := src1reg OP src2reg
* OP: destreg := src1reg OP immediate
* OP: destreg :OP= src1reg
* OP: destreg :OP= immediate
Control flow instructions usually don't fit this pattern. I.e. control flow instructions such as jumps don't really have a destination register, unless you count the program counter. Moreover, control flow instructions want to have a large scope - i.e. they want the immediate constant to be as broad as possible.

E.g while the following reuse of a standard format "fits":
* BRANCH pcreg += offset
and allows other register+=constant arithmetic to benefit, even on machines that normally have a three operand format (dest,src1,src2 reg or constant)
* ADD destreg += large-constant
it is even better to fold the bits used for destreg into the offset, and let the branch instructions have an implicit PC destination, or really src/dest for PC-relative branches.
* BRANCH even-larger-offset
:: pc += even-larger-offset

(This even-larger-offset is most easily used for unconditional branches. Conditional branches need sme bits to specify either the register operand containing the condition, or the nature pf the condition on a condition code ISA.

This discussion of even-larger-offsets applies to "near" or "local" branches and CALLs - usually PC-relative. Indirect branches and indirect calls do not necessarily need an even-larger-constant operand. So the question aises as to what instruction format should they have?

= Indirect Branch =

Probably simplest is to have a single register field format. For indirect branch:
* INDIRECT-BRANCH: reg
:: pcreg := reg

We won't talk any more much about indirect branch much after this. Single register field is good enough, except for possibilities such as
* INDIRECT-BRANCH-TO-JUMP-INSTRUCTION-TABLE: reg, offset
:: pcreg := offset (of a jump table) + reg
:: OR pcreg := offset (of a jump table) + reg<versus
* INDIRECT-BRANCH-TO-MEMORY-TARGET: reg, offset
:: tmpreg := load( offset-of-jump-table + reg )
:: pcreg := tmpreg
Note that these two flavors are especially suited to switch tables.

The former, INDIRECT-BRANCH-TO-JUMP-INSTRUCTION-TABLE, computes an address and jumps to it. Typically the table is a table of jump instructions.
The scale factor allows more than a single instruction to be in each entry in the table.
* PRO: This table can be embedded in the instruction stream without issues of mixing code and data, since it contains instructions. It can be PC-relative.
* CON: consumes extra entries in the branch predictor. Less efficient user of bits, since there will be jump opcodes.

The latter, INDIRECT-BRANCH-TO-MEMORY-TARGET, isn't really a RISC instruction, since it does a load and a temporary-register indirect branch. You can always synthesize it by separate instructions. As for INDIRECT-BRANCH-TO-MEMORY-TARGET
* PRO: smaller than separate instructions
* PRO: more efficient in use of space than INDIRECT-BRANCH-TO-JUMP-INSTRUCTION-TABLE
* CON: not RISCy

= Indirect CALLs =

Branch-and-link, a common RISCy form of a register indirect call has two register fields.:
* BRANCH-AND-LINK: linkreg := current-pc; pc := srcreg

However, calling conventions usually mandate a single fixed register for the link register. So it is suboptimal to waste bits specifying the link register:
* BRANCH-AND-LINK: fixed-link-register := current-pc; pc := srcreg

So now we have branch and link wanting to have a single register field. Similarly for a CALL instruction that pushes the old instruction pointer on a stack:
* BRANCH-AND-LINK: srcreg
:: M[sp] := current pc
:: pc := srcreg
:: sp -= sizeof(pc)

(Note that such a call is transactional: you may not want to have modified the stack if the calructions as well.l fails, etc. Considerable ingenuity goes into creating CALL instructions with one, or the fewest possible, microinstructions.)

If you want to support a monadic register operand instruction, great. You might want to use it for other instructions as well.

However, there may be ways of using extra operands, particularly literal operands, for CALL indirect.

For example, many indirect CALLs are to the member functions of object classes. These are obtained by a vtable lookup:
* CALL reg+offset
:: tmpreg := load M[reg + small-offset]
:: pc := tmpreg

This is like BRANCH-INDIRECT-TO-MEMORY. Difference: with branch indirect used for switch tables, the register tends to be a small offset into the table, whereas for method calls the register tends to be a full range address of a heap object.

The usual pros and cons:
* PRO: smaller than separate instructions
* CON: not RISCy

Q: what else can we use extra operands on CALL indirect for?

= Default Indirect Target =

One of the biggest problems with indirect function calls (and to a lesser extent indirect branches) is target prediction.

On a machine with a short pipeline, you might only need a [[branch direction predictor]], taken/not-taken.
You might be able to get away without a [[BTB]], if the decoder arrives soon enough after instruction fetch,
since for most control flow transfers the branch target (or call target) is a constant, embedded in the instruction stream.

Not so for indirect transfers: the indirect branch or call target is not in the instruction stream.
Even if it is already known, it is hard to access.

On machines with deeper pipelines, [[BTB]]s (branch target buffers) are needed so that instruction fetch can be resteered even though the branch target from the instruction stream has not been decoded. But there is still an advantage to having the branch or call target in the instruction stream: the decoder can detect a BTB miss, and resteer, without waiting for execution.

Trouble is, every indirect branch or call needs a BTB entry.

Most instruction sets admit PC-relative or CS-relative or absolute function addresses.
Things are made even worse on instruction sets, such as the old IBM 360, where most CALLs are base register relative
- implying nearly all branches need a BTB entry.

But the transition to object oriented code makes this worse. Method calls, virtual function calls, are indirect, and hence require BTB entries.

What is especially frustrating is that oftentimes we know what the most probable indirect branch or CALL target is.
So we will proceed to discuss how to convey this knowledge across the instruction set to the branch prediction hardware.

== Hint ==

The most obvious way is to use HINT instructions. E.g.
:HINT: call target = F00
:BRANCH-AND-LINK: (reg)

The hint instruction can be scheduled in advance.

== Front-end Registers and Prepare-to-Branch ==

Some instruction sets try to give the compiler explicit control by providing a set of front-end registers. You move an indirect branch or call target to such a register, and then branch. Can be separated. Unfortunately, often hard to separate: consider a pointer chasing loop calling a virtual function.

== Other Tricks ==

In the P6 (Pentium Pro) Code Optimization Handbook I recommended that switch table code be generated as follows:

ideally:
:JMP (reg)
:default-target follows jump instruction

less ideally:
:JMP (reg)
:JMP default-target

First, because there was no HINT instruction in x86 at that time.

Second, more important, because at least instruction fetch would have prefetched the default target in the former case,
while decoder shortstop branch prediction might do the right thing in the latter case.

Arguably, these are better than HINT instructions, since they are not executed if branch prediction is working.
They only kick in if branch prediction is not working.

Unfortunately, these techniques cannot be used for indirect CALL.
It is impractical to rearrange function code to fall inline after the call instruction. especially if separately compiled.
And you cannot do

:CALL (reg)
:JMP default-target

or

:CALL (reg)
:CALL default-target

because the function return will return to the instruction after the original CALL.

It is tempting to try to cram a default target into the CALL instruction as an extra operand. But that's like a hint: it wastes bits if predicting correctly.

Here's a slightly less wasteful alternative: give CALL the ability to say that the next few instructions should be skipped on return:

:CALL-INDIRECT-WITH-SKIP (reg), skip
:: push current-pc + skip
:: pc := reg (or pc := load M[ reg + offset ]
:CALL default-target
:skip:

or

:BRANCH-AND-LINK: (reg),skip
:: linkreg := pc + skip
:: pc := reg (or ...)
:SOMETHING-WITH default-target
:: could be a hint, or a branch, or ...
:skip:

This is not like x86 RETn: that specifies an stack offset. This is a code offset.

It is better to do this at CALL site than RETurn site, since it doesn't affect calling conventions.

Let's call this [[CALL-INDIRECT-WITH-SKIP]]. or that matter, could have [[BRANCH-AND-LINK-INDIRECT-WITH-SKIP]], although the latter is not really necessary.

The skip is really probably only a few instructions. I.e. it is not a very big extra operand. Maybe it should be full width: this gives us a sort of threaded code style.

:META-COMMENT: what we are seeing here is that, in some situations, it is better to move HINT instructions out of the code. In this case, in the code stream, but in a place where it would not normally be executed except by mis-speculation. Actually, in all cases of correct prediction, hint instructions are wasteful. There have been proposals to move hint and other meta-instructions ad directives completely out of line, to a code stream paralleling the original code stream. But this has complexity. Hiding hints after unconditionally take branches in this manner is easier to manage.

== Object Class Indication ==

Here's another way of using the extra operands for indirect CALLs.

Oftentimes, objects with virtual functions go to a single common place, that can be predicted early - perhaps at compile time.
However, with separate compilation and linking, the virtual functions may not really be known until link time.
Still, we can talk about having linkers patch the code to provide the possible hints.

But in some situations the virtual functions are not known at link time.
I.e. object inheritance is genuinely being used:
polymorphic objects with different virtual functions for the same method may be in use.

If the objects really are randomly and dynamically polymorphic, there isn't much we can do.

But oftentimes, polymorphism is being used, but is restricted or semi-constant. E.g. you might use a factory to create different polymorphic objects for an interface, but thereafter all instances of that object would be the same. I.e. the default virtual function target is not known at compile or link time, but is constant for the life of the program for objects of that class.

BTBs handle this, but at the cost of one BTB entry per virtual function call.

Here's a slight;y offbeat idea, slightly new to me: It may be advantageous for a compiler to inject a level of indirection:

Instead of
:...
:CALL (reg)offset // using virtual function table
:...
:CALL (reg)offset // using virtual function table
:...
:CALL (reg)offset // using virtual function table
...

using a separate BTB entry for each

it may be better to do:

:...
:JMP local-virtual-function-proxy-offset // using virtual function table
:...
:JMP local-virtual-function-proxy-offset // using virtual function table

:JMP local-virtual-function-proxy-offset // using virtual function table
:...

with
:local-virtual-function-proxy-offset: // using virtual function table
:: CALL (reg)offset

* PRO: a single BTB entry for the actual CALL
* CON: BTB entries for the jumps
** PRO: the jumps can be decoder predicted if BTB missing
* CON: would have to arrange for all indirect sites to use same register

It may be possible to use the extra operand of an indirect CALL to accomplish the same thing.

Basically, we are trying to establish equivalence classes that might be used to reduce BTB pressure.

The most obvious thing is to put an "equivalence class number" into the indiect CALL, and use that to index an "equibalence class BTB" at decde time.

However [[CALL-INDIRECT-WITH-SKIP]] accomplishes this more flexibly

:CALL-INDIRECT-WITH-SKIP (reg), skip
:: push current-pc + skip
:: pc := reg (or pc := load M[ reg + offset ]
:HINT: use same branch (target) prediction as some other instruction
:skip:

Two new ideas: CALL-INDIRECT-WITH-SKIP and hint to use same prediction as some other instruction.