This page describes a new page replacement design by Rik van Riel, Lee Schermerhorn and others.
If you spot any conceptual problems or oversights in this design, or have questions on the details, please email or IRC RikvanRiel. Once nobody can find holes, the concepts will probably work
The problem
In brief, the current page replacement mechanism in Linux (up to and including 2.6.24) has two problems:
Sometimes the kernel evicts the wrong pages, resulting in bad performance.
The kernel scans over pages that should not be evicted. This results in increased CPU use on large memory systems (several GB), but results in catastrophic CPU use and lock contention on huge systems (>128GB of RAM).
More details on the problem space can be found on ProblemWorkloads and page replacement requirements.
High level overview
Evicts pages with minimal scanning. Systems with 1TB RAM exist, the VM cannot afford to scan pages that should not be evicted.
Most page churn comes from reading large files. Evict those pages first.
Evict other pages when we do not have enough memory for readahead or we keep reading in the same file data over and over again.
Filesystem IO is much more efficient than swap IO. Only evict file cache if we can.
Use reference and refault data to balance filesystem cache and anonymous (process) memory.
Optimize the normal case; also optimize the worst case. The VM must be robust at all times, so the worst case has to work well too.
File cache
Page replacement very similar to the current 2.6 kernel algorithm.
Sizing the active list could be done better with data on recently evicted non-resident pages, which we want for other purposes anyway.
Anonymous memory
Contains memory from processes, shared memory segments and the swap backed filesystem (tmpfs).
Most of the time we evict file pages, the anonymous pages get scanned rarely.
When there is no free swap space, do not scan.
Non reclaimable
Contains mlocked processes and mlocked shared memory segments, as well as ramfs and ramdisk pages.
When a page gets mlocked, it is added to the non reclaimable pool.
At munlock time, it is moved to either the file or anonymous lists.
Non reclaimable pages are never scanned.
Recently evicted / nonresident
Does not contain actual pages, only information that the page was recently evicted.
Implemented with very low space overhead. Can live in the page cache radix tree.
OOM killer
Currently, Linux has a hard time figuring out exactly when it has run out of memory.
Knowing exactly how much memory is filesystem backed and how much is swap/ram backed will help us.
Merge plan
Split up into a patch series (done, will need some reshuffling).
Andrew and Linus can apply the least controversial bits first, with no need to get everything applied at once.
Old page contents
Design tenets
File IO is fundamentally more efficient than swap IO. This has a number of reasons:
Pages are swapped out in an LRU-like fashion. File content usually is already on disk; we can drop the page without IO.
Multiple rounds of malloc and free can mix up application memory. File contents are usually related, so we can do efficient readahead.
Swap administration in Linux is very simple (also low overhead).
We have to deal with systems where swap is insignificantly small, eg. a database server with 128GB RAM, 2GB swap and an 80GB shared memory segment.
Belady's MIN "algorithm" needs to be modified. A page replacement algorithm does not have as its primary goal to minimize the number of page cache and anonymous memory misses. Instead, the goal is to minimize the number of IO operations required.
If we keep some statistics, we can measure exactly how much more efficient file IO and swap IO are, for the workload that the system is currently running.
Using those statistics, in combination with other information, we can efficiently size the "LRU" pools for anonymous and file backed memory.
If there is no swap space we do not try to shrink the anonymous memory pool.
Since the basic split is on "IO cost", memory mapped pages (except shared memory segments) go into the file backed pool.
We need a scan resistant algorithm (see AdvancedPageReplacement) to select which pages to free.
Design details
One set of pageout selection lists (LRU, CLOCK-Pro, ...) for anonymous pages and another one for file backed pages.
We balance the size of the anonymous and file backed pools according to these criteria:
In order to not have IO mess up the LRU (or CLOCK-Pro, ...) order in each pool, we could have a separate out list for pages that are about to be evicted. This would also take care of potential "this disk is congested by file IO, so we never do swapout" bugs.
IO cost
The cost we want to measure is the number of disk seeks required for evicting and bringing back in each type of page (anonymous vs file backed). In many workloads, IO clustering will be able to bring this down to less than one disk seek per page, ie. multiple useful pages per disk seek.
The cost of IO of each kind of pages is the sum of the "in" and the "out" IO costs. We will probably want to work with a floating average (over the last 25% of the pages in the pool?) for each of these statistics.
Because most anonymous pages get instantiated "out of nowhere", the initial IO cost for anonymous pages will have to be initialized to something reasonable. Probably something like 1/4 to 1/2 of the maximum efficiency we could reach based on swap_cluster_max.
Pool sizing
When deciding whether one pool needs to grow (at the expense of the other), a number of factors can be taken into account:
The fraction of pages that were referenced when they reached the end of the pageout list(s) in each pool.
The number and distance of refaults (with /proc/refaults infrastructure) incurred for each pool.
If there is no more free swap space left, don't bother trying to shrink the anonymous memory pool.
The basic thing we want to measure here is the per-page pressure in each pool. The VM tries to equalize (pressure * IO cost) between the two pools.
The per page activity in each pool can be calculated: (fraction referenced * scan rate) / pool size
Pageout selection
Within each pool, the VM needs to select pages to evict. Large IO should not put undue pressure on the more actively used pages in the pool. An algorithm like CLOCK-Pro may be ideal here, but basically any pageout selection algorithm can be integrated into this basic design. This is something we can do later, leaving the current Linux pageout algorithm in place at first and only moving to a more scan resistant algorithm later in the future - if at all.
Incremental implementation
To avoid destabilizing Linux, the changes suggested above will need to be implemented, tested and maybe integrated one by one. Each change will need to be relatively contained and make sense all by itself.
Things can be implemented in this order:
Measure the page utilization (% of referenced pages, scan rate, rotate rate, etc). Chances are these statistics are already pretty complete for /proc/vmstat. Maybe a decaying average or two will need to be added?
Split anonymous vs. file backed pageout queues. (assume swap IO is 4x as expensive as file IO?)
Experiment with /proc/refault - does the data from refaults help with pool sizing in some workloads?
Experiment with other replacement algorithms - do any of them give us a significant benefit on some workloads?
CategoryAdvancedPageReplacement

