Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, April 05, 2011

Why_saying_"You_need_N-way_associativity_in_the_TLB_to_guarantee_forward_progress"_is_bogus_and_reflective_of_an_in-order_mindset

http://semipublic.comp-arch.net/wiki/Why_saying_%22You_need_N-way_associativity_in_the_TLB_to_guarantee_forward_progress%22_is_bogus_and_reflective_of_an_in-order_mindset
[[Category:Virtual Memory]]

= The Annoying Quote =

One occasionally hears (and reads in computer architecture textbooks) statements such as

"Instruction set XXX can access N memory locations in a single instruction, so therefore requires a TLB with at least N-way associativity."

This statement is wrong in several ways, both as an underestimate and as an overestimate.
It reflects ignorance of out-of-order processors, and even for in-order processors it reflects ignorance of other implementations.

= Basic Assumption =

The basic assumption begind this statement is something like this:

Assume you have an instruction whose operation may be described in pseudocode or microcode as:
tL1 := load(M[A1])
...
tLN := load(M[AN])

tC1 := f1(tL1..tLN)
...
tCN := fN(tL1..tLN)

store( M[A1] := tC1 )
...
store( M[AN] := tCN )

Assume that you have a [[restart versus resume instruction exception|restart and not a resume instruction fault architecture]] - i.e. assume that you must access all of M[A1] .. M[AN], eventually, without a fault or TLB miss.
Then you need N TLB entries to hold all of the [[translation]]s for M[A1]..M[A2].

Or, equivalently, assume that you are not allowed to take a fault or TLB miss in the "[[commit phase]]" of the instruction.
Then, once again, you need N TLB entries to hold all of the [[translation]]s for M[A1]..M[A2].

Sounds simple, eh?

= Out-of-order with Speculative TLB miss handling =

This betrays an in-order mindset. It does not necessarily work on an [[out-of-order]] machine that does not block on a TLB miss,
i.e. which can perform ALU operations, memory references, and TLB misses out-of-order.

It doesn't work because, even though an operation such as load, store, or a [[tickle]] in the pseudocode or microcode for an instruction may load an TLB entry,
this TLB entry may be thrashed out of the TLB by (a) a later operation in same instruction, (b) an earlier operation in the same instruction (remember, out-of-order), or (c) a TLB use from a different instruction (remember, out-of-order and non-blocking: other instructions may be executing at the same time).

In particular, not that there is northing that says that the operations within a single instruction will be performed in-order --- and, indeed, on an out-of-order machine like the Intel P6 where the microinstructions within an instruction were performed out-of-order - so you can't necessarily make assumptions about the order of accesses, how they will affect LRU, etc.

== Kluges to Make It Work ==

* Allow out-of-order between instructions, but impose ordering restrictions within an instruction - e.g. by implementing every instruction with a strictly in-ordr state machine.

* TLB misses in-order, at commit time

= Other Implementations =

== Save Translations ==
Allow a "translation" to be saved in a register.
tT1 := save_translation_and_permission(M[A1])
...
tTN := save_translation_and_permission(M[AN])

tL1 := load_using_saved_translation_and_permission(M[phys_tr=tT1])
...
tLN := load_using_saved_translation_and_permission(M[phys_tr=tTN])

tC1 := f1(tL1..tLN)
...
tCN := fN(tL1..tLN)

store_using_saved_translation_and_permission( M[phys_tr=tT1] := tC1 )
...
store_using_saved_translation_and_permission( M[phys_tr=tTN] := tCN )

Issues:
* such a "saved translation" should contain not only the physical address corresponding to a virtual address, but also permissions.
* it is relatively easy to provide such an operation to microcode. It is harder to make it available to software.
** it obviously cannot be provided to user code
** virtualizing such a saved translation may be a challenge - even the OS may not be allowed to see the true physical address or permissions

== Who Cares? So What? ==

One could take the viewpoint of "Who Cares?": allow multiple TLB misses in the same instruction, don't restart.

Issue:
* there arises the possibility of [[intra-instruction translation inconsistency]] - different accesses to the same virtual address may receive different translations, different physical addresses, or, perhaps worse, a check may pass but a subsequent access fault.

Again, one may say "Who cares? The OS should not be changing translations while an instruction may be in flight."

But
* Saying that the OS should not do something does not always mean that it will
* An OS implemented on top of a VMM may lead to issues: the VMM may not be tracking where the OS keeps its page tables
* While this strategy may be acceptable, it may have lousy performance because of the necessity of stopping multiple processors for a [[TLB shootdown]] while changing a translation, e.g. in a [[page table]] in memory.

= Conclusion =

Saying
"Instruction set XXX can access N memory locations in a single instruction, so therefore requires a TLB with at least N-way associativity."

* underestimates the TLB entries required for an out-of-order processor with non-blocking TLB miss handling

* overestimates the TLB entries required for several reasonable implementation strategies, such as "Save Translations" and "Who Cares?"

More accurately, one might say

"Instruction set XXX can access N memory locations in a single instruction,
and for a certain set of microarchitecture assumptions
may require a TLB with at least N-way associativity."

But there are other ways...