This page describes a new page replacement design by Rik van Riel, Lee Schermerhorn and others.  This design should meet the most important [:PageReplacementRequirements:page replacement requirements] as well as fix the VM behaviour in certain ProblemWorkloads.

If you spot any conceptual problems or oversights in this design, or have questions on the details, please email or IRC RikvanRiel.  Once nobody can find holes, the concepts will probably work :)

== High level overview ==
 * Evicts pages with minimal scanning.  Systems with 1TB RAM exist, the VM cannot afford to scan pages that should not be evicted.
 * Most page churn comes from reading large files.  Evict those pages first.
 * Evict other pages when we do not have enough memory for readahead or we keep reading in the same file data over and over again.
 * Filesystem IO is much more efficient than swap IO. Only evict file cache if we can.
 * Use reference and refault data to balance filesystem cache and anonymous (process) memory.

attachment:new26vm.gif

== File cache ==
 * Contains disk and network filesystem data and metadata.
 * Page replacement very similar to the current 2.6 kernel algorithm.
   * New pages start on the inactive list.
   * Pages that get used multiple times are promoted to the active list.
   * Pages that got used once (or never) are reclaimed.
 * Sizing the active list could be done better with data on recently evicted non-resident pages, which we want for other purposes anyway.

== Anonymous memory ==
 * Contains memory from processes, shared memory segments and the swap backed filesystem (tmpfs).
 * Need to minimize page scanning:
   * Typically consumes a majority of memory.
   * Only swapped out occasionally on many systems.
   * Every anonymous page starts out referenced.
 * Solution: SEQ replacement.
   * Always move pages from end of active list to inactive list and clear referenced bit.
   * Referenced pages get deactivated too.
   * Limits the maximum number of pages that need to be scanned to twice the size of the inactive list.
 * Inactive list should be a reasonable size.
   * Large enough to give pages a chance to be reused.
     * Needs some background aging, to avoid starting off with FIFO replacement.
   * Small enough to limit the maximum amount of work the VM needs to do.
   * Maybe 30% of anonymous pages on a 1GB system, but 1% of anonymous pages on a 1TB system?
 * Most of the time we evict file pages, the anonymous pages get scanned rarely.
 * When there is no free swap space, do not scan.

== Non reclaimable ==
 * Contains mlocked processes and shared memory segments, as well as ramfs and ramdisk pages.
 * When a page gets mlocked, it is added to the non reclaimable pool.
 * At munlock time, it is moved to either the file or anonymous lists.
 * Non reclaimable pages are never scanned.

== Recently evicted / nonresident ==
 * Does not contain actual pages, only information that the page was recently evicted.
 * Implemented with very low space overhead.  Can live in the page cache radix tree.

== Merge plan ==
 * Split up into a patch series.
   * LRU array code.
   * Non reclaimable pool.
   * Split into file/anon with trivial balancing code.
   * Non-resident (recently evicted) tracking.
   * SEQ replacement for anonymous pages.
   * Clockpro-like replacement for file pages.
 * Andrew and Linus can apply the least controversial bits first, with no need to get everything applied at once.
   * Improve the VM step by step, fixing more corner cases at each merge.
   * Makes incremental testing by a large community possible.

= Old page contents =

== Design tenets ==

 * File IO is fundamentally more efficient than swap IO.  This has a number of reasons:
  * Pages are swapped out in an LRU-like fashion.  File content usually is already on disk; we can drop the page without IO.
  * Multiple rounds of malloc and free can mix up application memory.  File contents are usually related, so we can do efficient readahead.
  * Swap administration in Linux is very simple (also low overhead).
 * We have to deal with systems where swap is insignificantly small, eg. a database server with 128GB RAM, 2GB swap and an 80GB shared memory segment. 
  * We cannot waste our time scanning 100GB of anonymous memory, to get at the 8GB freeable page cache!
  * We need separate pageout selection lists for anonymous and file backed pages.
 * Belady's MIN "algorithm" needs to be modified. A page replacement algorithm does not have as its primary goal to minimize the number of page cache and anonymous memory misses.  Instead, the goal is to minimize the number of IO operations required.
   * If we keep some statistics, we can measure exactly how much more efficient file IO and swap IO are, for the workload that the system is currently running.
   * Using those statistics, in combination with other information, we can efficiently size the "LRU" pools for anonymous and file backed memory.
   * If there is no swap space we do not try to shrink the anonymous memory pool.
 * Since the basic split is on "IO cost", memory mapped pages (except shared memory segments) go into the file backed pool.
   * We need a scan resistant algorithm (see AdvancedPageReplacement) to select which pages to free.

== Design details ==

 * One set of pageout selection lists (LRU, CLOCK-Pro, ...) for anonymous pages and another one for file backed pages.
 * We balance the size of the anonymous and file backed pools according to these criteria:
  * IO cost to replace and refill pages from each pool.
  * How actively used are the pages in each pool?
  * Ie. if we find that page cache IO is twice as expensive as swap IO under the current workload, the use of the pages in the file backed pool would have to be twice as high as that in the anonymous pool, in order for anonymous pages to be evicted.
 * In order to not have IO mess up the LRU (or CLOCK-Pro, ...) order in each pool, we could have a separate ''out'' list for pages that are about to be evicted.  This would also take care of potential "this disk is congested by file IO, so we never do swapout" bugs.

== IO cost ==

The cost we want to measure is the number of disk seeks required for evicting and bringing back in each type of page (anonymous vs file backed).  In many workloads, IO clustering will be able to bring this down to less than one disk seek per page, ie. multiple useful pages per disk seek.

 * Cost of swapping out anonymous pages:
  * Average number of pages clustered in a single swapout IO request, or at least the average number of contiguous allocations.
 * Cost of swapping in anonymous pages:
  * Average number of pages read in by swapin clustering.
  * Fraction of the swapped in pages used, vs. the ones that get evicted unused.
 * Cost of evicting file backed pages:
  * Fraction of file backed pages that need to be written out. Chances are most will be clean already.
  * No IO clustering is done yet for file writes through the pageout path. That will need to be fixed.
 * Cost of rereading file backed pages:
  * Average number of pages read in at once.  In many workloads the readahead code will make this number high.
  * Fraction of the read pages that get actually used, vs. the ones that get evicted unused.

The cost of IO of each kind of pages is the sum of the "in" and the "out" IO costs.  We will probably want to work with a floating average (over the last 25% of the pages in the pool?) for each of these statistics.

Because most anonymous pages get instantiated "out of nowhere", the initial IO cost for anonymous pages will have to be initialized to something reasonable.  Probably something like 1/4 to 1/2 of the maximum efficiency we could reach based on swap_cluster_max.

== Pool sizing ==

When deciding whether one pool needs to grow (at the expense of the other), a number of factors can be taken into account:
 * The size of each pool.
 * The rate at which we scan the pages in each pool.
 * The fraction of pages that were referenced when they reached the end of the pageout list(s) in each pool.
  * If file backed pages do not get referenced after their first time, the pool is not grown!
 * The number and distance of refaults (with /proc/refaults infrastructure) incurred for each pool.
 * If there is no more free swap space left, don't bother trying to shrink the anonymous memory pool.

The basic thing we want to measure here is the per-page pressure in each pool.  The VM tries to equalize (pressure * IO cost) between the two pools.  

The per page activity in each pool can be calculated: ''(fraction referenced * scan rate) / pool size''

== Pageout selection ==

Within each pool, the VM needs to select pages to evict.  Large IO should not put undue pressure on the more actively used pages in the pool.  An algorithm like CLOCK-Pro may be ideal here, but basically any pageout selection algorithm can be integrated into this basic design.  This is something we can do later, leaving the current Linux pageout algorithm in place at first and only moving to a more scan resistant algorithm later in the future - if at all.

= Incremental implementation =

To avoid destabilizing Linux, the changes suggested above will need to be implemented, tested and maybe integrated one by one.  Each change will need to be relatively contained and make sense all by itself.

Things can be implemented in this order:
 1. Measure the page utilization (% of referenced pages, scan rate, rotate rate, etc).  Chances are these statistics are already pretty complete for /proc/vmstat. Maybe a decaying average or two will need to be added?
 1. Split anonymous vs. file backed pageout queues.  (assume swap IO is 4x as expensive as file IO?)
 1. Actually measure the IO cost of each, and use that.
 1. Experiment with /proc/refault - does the data from refaults help with pool sizing in some workloads?
 1. Experiment with other replacement algorithms - do any of them give us a significant benefit on some workloads?
----
CategoryAdvancedPageReplacement