This page describes a new page replacement design by Rik van Riel, Lee Schermerhorn and others.

If you spot any conceptual problems or oversights in this design, or have questions on the details, please email or IRC RikvanRiel.  Once nobody can find holes, the concepts will probably work :)

== The problem ==

In brief, the current page replacement mechanism in Linux (up to and including 2.6.24) has two problems:
 * Sometimes the kernel evicts the wrong pages, resulting in bad performance.
 * The kernel scans over pages that should not be evicted.  This results in increased CPU use on large memory systems (several GB), but results in catastrophic CPU use and lock contention on huge systems (>128GB of RAM).

More details on the problem space can be found on ProblemWorkloads and [:PageReplacementRequirements:page replacement requirements].

== High level overview ==
 * Evicts pages with minimal scanning.  Systems with 1TB RAM exist, the VM cannot afford to scan pages that should not be evicted.
 * Most page churn comes from reading large files.  Evict those pages first.
 * Evict other pages when we do not have enough memory for readahead or we keep reading in the same file data over and over again.
 * Filesystem IO is much more efficient than swap IO. Only evict file cache if we can.
 * Use reference and refault data to balance filesystem cache and anonymous (process) memory.
 * Optimize the normal case; also optimize the worst case. The VM must be robust at all times, so the worst case has to work well too.

attachment:new26vm.gif

== File cache ==

On typical systems, there is many times more disk space than memory. Most file data is accessed infrequently, but in bulk (lots of streaming IO at a time). A small amount of file data is accessed over and over again and needs to be protected from the streaming IO.

 * Contains disk and network filesystem data and metadata.
 * Used-once page replacement algorithm.
   * New pages start on the inactive file list.
   * Pages that get used multiple times are promoted to the active file list.
   * Pages that got used once (or never) are reclaimed.
 * When the active file list is larger than the inactive file list, some active file pages are moved to the inactive file list.
   * Sizing the active list could be done better with data on recently evicted pages. However, the simple approach seems to work well enough.

== Anonymous memory ==
 * Contains memory from processes, shared memory segments and the swap backed filesystem (tmpfs).
   * A relatively small amount of data, similar to the amount of memory in the system.
   * Anonymous memory tends to get accessed relatively frequently.
 * Need to minimize page scanning:
   * Typically consumes a majority of memory.
   * Only swapped out occasionally on many systems.
   * Every anonymous page starts out referenced.
 * Solution: SEQ replacement.
   * Always move pages from end of active anon list to inactive anon list, regardless of whether or not the page was referenced.
   * A page that gets referenced while on the inactive anon list, gets moved back to the active anon list.
   * A page that reaches the end of the inactive list without being referenced, gets evicted.
   * Limits the maximum number of pages that need to be scanned to twice the size of the inactive list.
 * Inactive list should be a reasonable size.
   * Large enough to give pages a chance to be reused.
     * Needs some background aging, to avoid starting off with FIFO replacement.
   * Small enough to limit the maximum amount of work the VM needs to do.
   * Maybe 30% of anonymous pages on a 1GB system, but 1% of anonymous pages on a 1TB system?
 * Most of the time the VM evicts file pages only; anonymous pages get scanned rarely.
 * When there is no free swap space, do not scan anonymous pages.

== Non reclaimable ==
 * Contains mlocked processes and mlocked shared memory segments, as well as ramfs and ramdisk pages.
 * When a page gets mlocked, it is added to the non reclaimable pool.
 * At munlock time, it is moved to either the file or anonymous lists.
 * Non reclaimable pages are never scanned.

== Recently evicted / nonresident ==
 * Not implemented in the current 2.6 kernel, since the VM seems to work well without it.
   * Would only be used for additional feedback in placing faulted in pages and sizing the lists.
 * Would not contain actual pages, only information that the page was recently evicted.
 * Can be implemented with very low space overhead.  Can live in the page cache radix tree.

== OOM killer ==
 * Currently, Linux has a hard time figuring out exactly when it has run out of memory.
 * Knowing exactly how much memory is filesystem backed and how much is swap/ram backed will help us.
   * When swap is full, memory is full and the amount of filesystem backed memory is really low, we need to OOM kill something.

== Merge plan ==
 * The split LRU VM got merged in 2.6.28, with minor fixes and refinements ongoing.
 * Split up into a patch series (done, will need some reshuffling).
   * LRU array code.
   * Non reclaimable pool.
   * Split into file/anon with trivial balancing code.
   * Non-resident (recently evicted) tracking.
   * SEQ replacement for anonymous pages.
   * Clockpro-like replacement for file pages.
   * Make the OOM killer faster, using the test above.
 * Andrew and Linus can apply the least controversial bits first, with no need to get everything applied at once.
   * Improve the VM step by step, fixing more corner cases at each merge.
   * Makes incremental testing by a large community possible.

= Old page contents =

== Design tenets ==

 * File IO is fundamentally more efficient than swap IO.  This has a number of reasons:
  * Pages are swapped out in an LRU-like fashion.  File content usually is already on disk; we can drop the page without IO.
  * Multiple rounds of malloc and free can mix up application memory.  File contents are usually related, so we can do efficient readahead.
  * Swap administration in Linux is very simple (also low overhead).
 * We have to deal with systems where swap is insignificantly small, eg. a database server with 128GB RAM, 2GB swap and an 80GB shared memory segment. 
  * We cannot waste our time scanning 100GB of anonymous memory, to get at the 8GB freeable page cache!
  * We need separate pageout selection lists for anonymous and file backed pages.
 * Belady's MIN "algorithm" needs to be modified. A page replacement algorithm does not have as its primary goal to minimize the number of page cache and anonymous memory misses.  Instead, the goal is to minimize the number of IO operations required.
   * If we keep some statistics, we can measure exactly how much more efficient file IO and swap IO are, for the workload that the system is currently running.
   * Using those statistics, in combination with other information, we can efficiently size the "LRU" pools for anonymous and file backed memory.
   * If there is no swap space we do not try to shrink the anonymous memory pool.
 * Since the basic split is on "IO cost", memory mapped pages (except shared memory segments) go into the file backed pool.
   * We need a scan resistant algorithm (see AdvancedPageReplacement) to select which pages to free.

== Design details ==

 * One set of pageout selection lists (LRU, CLOCK-Pro, ...) for anonymous pages and another one for file backed pages.
 * We balance the size of the anonymous and file backed pools according to these criteria:
  * IO cost to replace and refill pages from each pool.
  * How actively used are the pages in each pool?
  * Ie. if we find that page cache IO is twice as expensive as swap IO under the current workload, the use of the pages in the file backed pool would have to be twice as high as that in the anonymous pool, in order for anonymous pages to be evicted.
 * In order to not have IO mess up the LRU (or CLOCK-Pro, ...) order in each pool, we could have a separate ''out'' list for pages that are about to be evicted.  This would also take care of potential "this disk is congested by file IO, so we never do swapout" bugs.

== IO cost ==

The cost we want to measure is the number of disk seeks required for evicting and bringing back in each type of page (anonymous vs file backed).  In many workloads, IO clustering will be able to bring this down to less than one disk seek per page, ie. multiple useful pages per disk seek.

 * Cost of swapping out anonymous pages:
  * Average number of pages clustered in a single swapout IO request, or at least the average number of contiguous allocations.
 * Cost of swapping in anonymous pages:
  * Average number of pages read in by swapin clustering.
  * Fraction of the swapped in pages used, vs. the ones that get evicted unused.
 * Cost of evicting file backed pages:
  * Fraction of file backed pages that need to be written out. Chances are most will be clean already.
  * No IO clustering is done yet for file writes through the pageout path. That will need to be fixed.
 * Cost of rereading file backed pages:
  * Average number of pages read in at once.  In many workloads the readahead code will make this number high.
  * Fraction of the read pages that get actually used, vs. the ones that get evicted unused.

The cost of IO of each kind of pages is the sum of the "in" and the "out" IO costs.  We will probably want to work with a floating average (over the last 25% of the pages in the pool?) for each of these statistics.

Because most anonymous pages get instantiated "out of nowhere", the initial IO cost for anonymous pages will have to be initialized to something reasonable.  Probably something like 1/4 to 1/2 of the maximum efficiency we could reach based on swap_cluster_max.

== Pool sizing ==

When deciding whether one pool needs to grow (at the expense of the other), a number of factors can be taken into account:
 * The size of each pool.
 * The rate at which we scan the pages in each pool.
 * The fraction of pages that were referenced when they reached the end of the pageout list(s) in each pool.
  * If file backed pages do not get referenced after their first time, the pool is not grown!
 * The number and distance of refaults (with /proc/refaults infrastructure) incurred for each pool.
 * If there is no more free swap space left, don't bother trying to shrink the anonymous memory pool.

The basic thing we want to measure here is the per-page pressure in each pool.  The VM tries to equalize (pressure * IO cost) between the two pools.  

The per page activity in each pool can be calculated: ''(fraction referenced * scan rate) / pool size''

== Pageout selection ==

Within each pool, the VM needs to select pages to evict.  Large IO should not put undue pressure on the more actively used pages in the pool.  An algorithm like CLOCK-Pro may be ideal here, but basically any pageout selection algorithm can be integrated into this basic design.  This is something we can do later, leaving the current Linux pageout algorithm in place at first and only moving to a more scan resistant algorithm later in the future - if at all.

= Incremental implementation =

To avoid destabilizing Linux, the changes suggested above will need to be implemented, tested and maybe integrated one by one.  Each change will need to be relatively contained and make sense all by itself.

Things can be implemented in this order:
 1. Measure the page utilization (% of referenced pages, scan rate, rotate rate, etc).  Chances are these statistics are already pretty complete for /proc/vmstat. Maybe a decaying average or two will need to be added?
 1. Split anonymous vs. file backed pageout queues.  (assume swap IO is 4x as expensive as file IO?)
 1. Actually measure the IO cost of each, and use that.
 1. Experiment with /proc/refault - does the data from refaults help with pool sizing in some workloads?
 1. Experiment with other replacement algorithms - do any of them give us a significant benefit on some workloads?
----
CategoryAdvancedPageReplacement