The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, February 27, 2011

Cache replacement - tweaks and variations

{{Terminology Term}}

[[Cache replacement]] or [[cache victim selection]] is the method one uses to choose which cache line will be overwritten or replaced, i.e. victimized, when a new cache line is pulled in.

Common algorithms include
* [[Random replacement]]
* [[LRU (least recently used)]]
** approximations to [[LRU]:
*** [[pseudo-LRU]] or [[tree LRU]]
*** [[the clock algorithm]]

= Minor Tweaks =

== When is the cache [[LRU information]] or [[cache usage information]] updated? ==

E.g. do you update it speculatively, for instructions that may be on a [[branch misprediction wrong path]], thereby corrupting the LRU?

Should you update at retirement? (Probably not, but just in case.)
:By the way, here is a reason to update at retirement: so that you get more [[deterministic behavior]], which eases validation. However, this only works if you do not do [[speculative cache misses]].

Perhaps you should only update it for slightly speculative instructions:
* e.g. do not update for highly speculative [[SpMT]] threads?
* e.g. update only for isntructions that have not retired, but for which all earlier branches have been resolved. (Such instructions are stil speculative - there may be a page fault or exception or interrupt - but they7 are only [[slightly speculative]]).

== What requests update the LRU information? ==

Should all requests, reads and writes, update the LRU information equally?

Many have proposed [[non-temporal hint bits]] in instructions, to say "ignore this access".

It is an open issue whether prefetches, whether initiated by [[prefetch instruction]]s or by [[hardware prefetchers]],
should update the LRU.

Multilevel caches:
* update the LRU on all accesses
* on misses
* on [[LRU leak-through]] from the inner cache
* only on capacity evictions from the inner cache (suggested in comp.arch by EricP on 2/25/2011)

By the way, the idea of updating the cache usage information only on capacity evictions from the inner cache
exposes several issues:

    Updating the LRU information and advancing the LRU pointer are two different issues. Invalidation traffic may result in several coherency cache misses between capacity misses. It would be bad to keep choosing the same victim. Oftentimes coherency misses do not need a victim chosen: they fill into the empty or stale or non-present line left behind by the invalidation. (May not happen in all systems.) Dirty writebacks naturally notify the outer cache of a capacity replacement. However, replacing a clean line may not naturally require such notiification: i.e. we may have [[silent replacement of clean cache lines]]. This may require [[LRU leakthrough]] so that the outrr cache can track.

== When is the victim chosen? ==

Should you choose the victim at the time the cache is missed,
or at the time the data to be placed in the cache line has returned?

It may not matter on an in-order machine. However, on an out-of-order machine, chooising the victim early may raise isssues, such as what should happen if too many cache misses to the same set occur
- i.e. if the victim itself needs to be victimized before the first victim's replacement has arrived.

One advantage of choosing the victim early is that you may be able to send the data for the [[cache line fill]]
returning directly to where it belongs in the cache - you may not have to stage it through a [[fill buffer]],
and you may be able to avoid fairly expensive [[fill buffer forwarding]] logic.
I.e. you may be able to have [[data-less fill buffers]].

    Half baked Idea: choose the victim early. But also allocate a fill buffer. At [[fill time]], determine if the early victim choice is still accurate. If not, write to the fill buffer. (I call this a half baked idea because it doesn't really solve the problem of wanting to avoid data fill buffers. Elaboration: choose the victim early. Allocate a [[address and control fill buffer]], but do not allocate a [[data fill buffer]]. If the victim is thrashed, allocate a [[data fill; buffer]] (which may simply be a [[spill buffer]] allocated circularly. At fill time, choose.

== Biasing the victim choice ==

* Prefer invalid lines, rather than replacing valid lines.
* Prefer clean lines to dirty lines
:: Thereby avoiding [[dirty writeback]] traffic.

* in multilevel caches, prefer lines that are in none of, or the fewest, inner caches

* prefer to replace lines containing data to lines containing instructions
** or vice versa - although oftentimes [[I$]] misses are more expensive than [[D$]] misses

* extend the above to more data types - integer versus FP (FP is often accessed in a cache thrashing manner)

= Better cache replacement algorithms =

There has been much work on trying to adjust these algorithms.

Ideally, we know that [[Belady's algorithm]] is optimal for many assumptions.
It amounts to "replacing the cache line that will be used furthest in the future".
It must be adjusted when there are non-uniform costs.

    There have been attempts to [[approximate Belady cache replacement using speculation and lookahead]]. E.g. do not replace a line that is used in an instruction window. This works better when you have a really large instruction window, such as might be provided by [[SpMT]]. It also may not be necessary if the [[LRU bits]] or [[cache usage information]] is updated speculatively. Others have attempted to unroll memory access pattern predictors, to get a list of predicted accesses, against which a Belady query can be made

See [[Victim Choice for Multilevel and Shared Caches]] for a discussion of issues in multilevel cache victim selection.
One of the main issues is that in a [[multilevel cache hierarchy]] the LRU bits of the outer caches are not adjusted
by accesses to the inner caches, so choosing a victim based purely on LRU bits updated only by accesses sent to the outer cache
is often not good.
Many proposals [[leak-through LRU]] information, to allow the outer cache to track the inner LRU.

It is well known that certain access patterns are not well suited to LRU cache replacement.
For example, circularly accessing N lines, in a cache of M lines, M < N, is better suited by MRU cache replacement than LRU.
Many have proposed to exploit this, e.g. by [[non-temporal]] hint bits attached to instructions,
or by predictors that attempt to identify such non-temporal cache access patterns.

No comments: