This page describes a new page replacement design by Rik van Riel. This design should meet the most important [:PageReplacementRequirements:page replacement requirements] as well as fix the VM behaviour in certain ProblemWorkloads. If you spot any conceptual problems or oversights in this design, or have questions on the details, please email or IRC RikvanRiel. Once nobody can find holes, the concepts will probably work :) == Design tenets == * File IO is fundamentally more efficient than swap IO. This has a number of reasons: * Pages are swapped out in an LRU-like fashion. File content usually is already on disk; we can drop the page without IO. * Multiple rounds of malloc and free can mix up application memory. File contents are usually related, so we can do efficient readahead. * Swap administration in Linux is very simple (also low overhead). * We have to deal with systems where swap is insignificantly small, eg. a database server with 128GB RAM, 2GB swap and an 80GB shared memory segment. * We cannot waste our time scanning 100GB of anonymous memory, to get at the 8GB freeable page cache! * We need separate pageout selection lists for anonymous and file backed pages. * Belady's MIN "algorithm" needs to be modified. A page replacement algorithm does not have as its primary goal to minimize the number of page cache and anonymous memory misses. Instead, the goal is to minimize the number of IO operations required. * If we keep some statistics, we can measure exactly how much more efficient file IO and swap IO are, for the workload that the system is currently running. * Using those statistics, in combination with other information, we can efficiently size the "LRU" pools for anonymous and file backed memory. * If there is no swap space we do not try to shrink the anonymous memory pool. * Since the basic split is on "IO cost", memory mapped pages (except shared memory segments) go into the file backed pool. * We need a scan resistant algorithm (see AdvancedPageReplacement) to select which pages to free. == Design details == * One set of pageout selection lists (LRU, CLOCK-Pro, ...) for anonymous pages and another one for file backed pages. * We balance the size of the anonymous and file backed pools according to these criteria: * IO cost to replace and refill pages from each pool. * How actively used are the pages in each pool? * Ie. if we find that page cache IO is twice as expensive as swap IO under the current workload, the use of the pages in the file backed pool would have to be twice as high as that in the anonymous pool, in order for anonymous pages to be evicted. * In order to not have IO mess up the LRU (or CLOCK-Pro, ...) order in each pool, we could have a separate ''out'' list for pages that are about to be evicted. This would also take care of potential "this disk is congested by file IO, so we never do swapout" bugs. == IO cost == The cost we want to measure is the number of disk seeks required for evicting and bringing back in each type of page (anonymous vs file backed). In many workloads, IO clustering will be able to bring this down to less than one disk seek per page, ie. multiple useful pages per disk seek. * Cost of swapping out anonymous pages: * Average number of pages clustered in a single swapout IO request, or at least the average number of contiguous allocations. * Cost of swapping in anonymous pages: * Average number of pages read in by swapin clustering. * Fraction of the swapped in pages used, vs. the ones that get evicted unused. * Cost of evicting file backed pages: * Fraction of file backed pages that need to be written out. Chances are most will be clean already. * No IO clustering is done yet for file writes through the pageout path. That will need to be fixed. * Cost of rereading file backed pages: * Average number of pages read in at once. In many workloads the readahead code will make this number high. * Fraction of the read pages that get actually used, vs. the ones that get evicted unused. The cost of IO of each kind of pages is the sum of the "in" and the "out" IO costs. We will probably want to work with a floating average (over the last 25% of the pages in the pool?) for each of these statistics. Because most anonymous pages get instantiated "out of nowhere", the initial IO cost for anonymous pages will have to be initialized to something reasonable. Probably something like 1/4 to 1/2 of the maximum efficiency we could reach based on swap_cluster_max. == Pool sizing == When deciding whether one pool needs to grow (at the expense of the other), a number of factors can be taken into account: * The size of each pool. * The rate at which we scan the pages in each pool. * The fraction of pages that were referenced when they reached the end of the pageout list(s) in each pool. * If file backed pages do not get referenced after their first time, the pool is not grown! * The number and distance of refaults (with /proc/refaults infrastructure) incurred for each pool. * If there is no more free swap space left, don't bother trying to shrink the anonymous memory pool. The basic thing we want to measure here is the per-page pressure in each pool. The VM tries to equalize (pressure * IO cost) between the two pools. The per page activity in each pool can be calculated: ''(fraction referenced * scan rate) / pool size'' == Pageout selection == Within each pool, the VM needs to select pages to evict. Large IO should not put undue pressure on the more actively used pages in the pool. An algorithm like CLOCK-Pro may be ideal here, but basically any pageout selection algorithm can be integrated into this basic design. This is something we can do later, leaving the current Linux pageout algorithm in place at first and only moving to a more scan resistant algorithm later in the future - if at all. = Incremental implementation = To avoid destabilizing Linux, the changes suggested above will need to be implemented, tested and maybe integrated one by one. Each change will need to be relatively contained and make sense all by itself. Things can be implemented in this order: 1. Measure the page utilization (% of referenced pages, scan rate, rotate rate, etc). Chances are these statistics are already pretty complete for /proc/vmstat. Maybe a decaying average or two will need to be added? 1. Split anonymous vs. file backed pageout queues. (assume swap IO is 4x as expensive as file IO?) 1. Actually measure the IO cost of each, and use that. 1. Experiment with /proc/refault - does the data from refaults help with pool sizing in some workloads? 1. Experiment with other replacement algorithms - do any of them give us a significant benefit on some workloads? ---- CategoryAdvancedPageReplacement