This is an almost complete design to approximate CLOCK-Pro replacement in a zoned VM with a big hash table instead of a non-resident list. Step 2 probably needs some work to filter out correlated references...
if, on page fault time, we fault in a page that was on the non-resident table (ie. recently evicted) and its inter reference distance is smaller than that of the least active page on the active list, we add the page to the active list immediately and increase the number of inactive pages we keep around - see Inter Reference Distance below.
active pages being referenced should also be used as an indication that we are indeed caching the working set, and the active list should not be shrunk too much - this emulates the HANDhot clears out Mtest pages from the CLOCK-Pro algorithm, but probably should not carry as much weight as recycling an occupied slot in the non-resident table
[SongJiang: The operation that HANDhot clears out the LRU non-resident cold pages out of the clock (free an occupied hash table slot) is critical in CLOCK-Pro, which makes sure that the reuse distances of cold pages (including non-resident) are always less than those of hot pages. For this reason, we might have to consider merging the two lists into one clock list. - RikvanRiel: could the mechanism described in Inter Reference Distance take care of this?]
if we end up refaulting too much (ie. too many pages have a reference distance of <2 times the size of memory), and the system load is high, we enable the swap token mechanism to prevent thrashing - we can measure this by looking at the goal for the number of inactive pages that this CLOCK-Pro approximation ends up setting
[SongJiang: The maximum reuse distance test period is 2 times the size of memory. However, the test period can be less than that and is variable. The period is basically determined by HANDtest. However, it is also related to HANDhot, because we ensure HANDtest always stays before HANDhot (i.e HANDhot might clear non-resident cold pages and push HANDtest ahead). Does the "system load" mean working set size? It doesn't make sense to me why the "refaulting" could cause thrashing. ]
[SongJiang: Being adaptive on hot/cold memory allocation makes thing less straightforward. I am going to clarify this tomorrow. I suggest fixing the cold page memory allocation at the first stage of the project in order to better understand the correslation between the algorithm and the measured performance.]
Linux vs CLOCK-Pro translations
Correlated References Filter
In point 2. above, we really only want to promote pages from the inactive list to the active list when they were referenced again after their initial page fault and reference. In short, we need a way to filter out correlated references. One idea would be to have a new pages list that's maybe 5% or 10% the size of the inactive list. New pages (after a page fault of an unknown page) go onto this list, and the reference bit gets cleared when the page leaves the list, to be placed at the head of the inactive list.
Another, possibly simpler, idea is to simply flag new (not refaulted from Mtest, or evicted from the active) pages with a PageNew flag. When we run into a PageNew page that was referenced, instead of promoting that page to the active list, we give it another go-around on the inactive list. If it got referenced again, on its second go-around, it will get promoted to the active list. Pressure from refaulting will make sure the active and inactive lists remain at the right size to give new pages a fair chance at being promoted before getting evicted.
Readahead & Inactive Target Adjustments
Readahead and IO clustering provides a bit of a conundrum. On the one hand, we are faulting these pages into memory, but on the other hand this isn't done because anybody actually needs the page right now. My current thinking is to use the "catch correlated references" logic to activate and count these pages. If the page was referenced when it reaches the end of the new pages list, we treat it as if it were a fresh page fault, clearing the entry in the page's entry in the non-resident table and adjusting the inactive target as needed.
Detecting Readahead Thrashing
When a page is being evicted, we could check to see if its information is still in the recently evicted table. If it is, we know we evicted a page before we got a chance to use it, and could have avoided the IO. If it happens frequently, we should think about reducing the amount of readahead being done, or maybe think about load control...
Inter Reference Distance
The non-resident set of pages consists of hash bucket within which we do FIFO/circular replacement of entries. By restricting the recently_evicted() function to looking at only the last N entries within each circular buffer, we can limit the size of the non-resident list. Let Tfoo be the time a page spends on list foo. If we keep Tcold + Ttest < Thot we can be reasonably sure that when we add a faulted in Mtest page to the active list, that page has a smaller inter-reference distance than the least recently used page on the active list. This assumes that the hash function used spreads the load around evenly between the hash buckets, which a cryptographically strong hash function should do. (This is theory by RikvanRiel and could be wrong, please comment)
Mtest < Mhot + Mcold # limit too low for second level caches ? Shot * (Mhot + Mcold + Mtest) < Scold * Mhot # invariant kept by limiting Mtest Mtest < ((Scold - Shot) / Shot) * Mhot) - Mcold # simplification of the line above Shot + Scold < Mhot + Mcold # Shot and Scold divided by two when limit reached
Note that we need to prejudice against Mtest pages that are being faulted in again, and in favor of the incumbent active pages. It is common for some workloads to stream through a large amount of data and access all those pages equally frequently. The only way to speed up such workloads is to "unfairly" cache part of the data set and not cache the rest. This means that a faulted in Mtest page can only displace an Mhot/active page if we are sure that the inter reference distance for the Mtest page is smaller than the inter reference distance for the least active page on the active list. When in doubt, protect the current working set - the dynamic resizing of the lists should take care of the rest.
It may be an idea to add a variable to the struct page and to the non-resident page to remember a floating average of the inter-reference distance. Every time we see that the page got referenced, we divide the variable by 2 and add half of the current inter-reference distance to it. This way we might be able to better judge the value of pages whose current inter-reference distance is right at the boundary of hot and cold...