This page describes a new page replacement design by Rik van Riel. This design should meet the most important [:PageReplacementRequirements:page replacement requirements] as well as fix the VM behaviour in certain ProblemWorkloads. == Design tenets == * File IO is fundamentally more efficient than swap IO. This has a number of reasons: * Pages are swapped out in an LRU-like fashion. File content usually is already on disk; we can drop the page without IO. * Multiple rounds of malloc and free can mix up application memory. File contents are usually related, so we can do efficient readahead. * Swap administration in Linux is very simple (also low overhead). * We have to deal with systems where swap is insignificantly small, eg. a database server with 128GB RAM, 2GB swap and an 80GB shared memory segment. * We cannot waste our time scanning 100MB of anonymous memory, to get at the 8GB freeable page cache! * We need separate pageout selection lists for anonymous and file backed pages. * Belady's MIN "algorithm" needs to be modified. A page replacement algorithm does not have as its primary goal to minimize the number of page cache and anonymous memory misses. Instead, the goal is to minimize the number of IO operations required. * If we keep some statistics, we can measure exactly how much more efficient file IO and swap IO are, for the workload that the system is currently running. * Using those statistics, in combination with other information, we can efficiently size the "LRU" pools for anonymous and file backed memory. * If there is no swap space we do not try to shrink the anonymous memory pool. * Since the basic split is on "IO cost", memory mapped pages (except shared memory segments) go into the file backed pool. * We need a scan resistant algorithm (see AdvancedPageReplacement) to select which pages to free. == Design details == * One set of pageout selection lists (LRU, CLOCK-Pro, ...) for anonymous pages and another one for file backed pages. * We balance the size of the anonymous and file backed pools according to these criteria: * IO cost to replace and refill pages from each pool. * How actively used are the pages in each pool? * Ie. if we find that page cache IO is twice as expensive as swap IO under the current workload, the use of the pages in the file backed pool would have to be twice as high as that in the anonymous pool, in order for anonymous pages to be evicted. * In order to not have IO mess up the LRU (or CLOCK-Pro, ...) order in each pool, we could have a separate ''out'' list for pages that are about to be evicted. This would also take care of potential "this disk is congested by file IO, so we never do swapout" bugs. == Pool sizing == When deciding whether one pool needs to grow (at the expense of the other), a number of factors can be taken into account: * The size of each pool. * The rate at which we scan the pages in each pool. * The fraction of pages that were referenced when they reached the end of the pageout list(s) in each pool. * If file backed pages do not get referenced after their first time, the pool is not grown! * The number and distance of refaults (with /proc/refaults infrastructure) incurred for each pool. * If there is no more free swap space left, don't bother trying to shrink the anonymous memory pool. The basic thing we want to measure here is the per-page pressure in each pool. The VM tries to equalize (pressure * IO cost) between the two pools.