Unfortunately the requirements for what a page replacement system should do are not clear. The best we can do is define the desired behavior by what it should NOT do, and try to go from there. If you have a workload and system configuration that totally break the Linux 2.6 memory management, please add it to the list.
Heavy anonymous memory workload
Synopsis: When memory is dominated by anonymous pages (JVM, big database) on large systems (16GB or larger), the Linux VM can end up spending many minutes scanning pages before swapping out the first page. The VM can take hours to recover from a memory crunch.
Details: Anonymous pages start their life cycle on the active list and referenced. As a consequence, the system will have a very large active list (millions of pages) that all have the recently-accessed bit set. The inactive list is tiny, typically around 1000 pages. The first few processes to dive into shrink_zone() will scan thousands of active pages, but will not even try to free a single inactive page:
zone->nr_scan_active += (zone_page_state(zone, NR_ACTIVE) >> priority) + 1; nr_active = zone->nr_scan_active; if (nr_active >= sc->swap_cluster_max) zone->nr_scan_active = 0; else nr_active = 0; zone->nr_scan_inactive += (zone_page_state(zone, NR_INACTIVE) >> priority) + 1; nr_inactive = zone->nr_scan_inactive; if (nr_inactive >= sc->swap_cluster_max) zone->nr_scan_inactive = 0; else nr_inactive = 0;
Because the first processes to go into try_to_free_pages() make no progress in shrink_zone(), they usually get joined by pretty much every other process in the system. Now you have hundreds of processes all scanning active pages, but not deactivating them yet because there are millions and they all have the accessed bit set. Lock contention on the zone->lru_lock and various other locks can slow things to a crawl.
Cases have been seen where it took over 10 minutes for the first process to call shrink_inactive_list()! By that time, the scanning priority of all processes has been reduced to a much smaller value, and every process in the system will try to reclaim a fairly large number of pages.
Pageout won't stop
Details: Sometimes pageout activity continues even when half of memory (or more) has already been freed, simply because of the large number of processes that are still active in the pageout code! In some workloads, it can take a few hours for the system to run normally again...
We will need to find a way to abort pageout activities when enough memory is free. On the other hand, higher-order allocations may want to continue to free pages for a while more. Maybe we need to pass the allocation order into try_to_free_pages() and call zone_watermark_ok() to test. Maybe Andrea Arcangeli's patches from June 8th 2007 are enough to fix this problem, especially "[PATCH 15 of 16] limit reclaim if enough pages have been freed".
Kswapd cannot keep up
During medium loads on the VM, kswapd will go to sleep in congestion_wait() even if many of the pages it ran into were freeable and no IO was started. This can easily happen if the inactive list is small and kswapd simply does not scan enough (or any!) inactive pages in shrink_zone(), focussing its efforts exclusively on moving pages from the active list to the inactive list.
/* * OK, kswapd is getting into trouble. Take a nap, then take * another pass across the zones. */ if (total_scanned && priority < DEF_PRIORITY - 2) congestion_wait(WRITE, HZ/10);
The above code is buggy because kswapd goes to sleep even when it did not get into any trouble. Going to sleep here if a lot of IO has been queued is a good idea, but going to sleep just because kswapd has been working the active list instead of the inactive list causes trouble.
The direct reclaim path can cause lock contention in the VM, when multiple processes dive into the pageout code simultaneously. Every time one locking issue has been fixed, contention happens on the next lock in the series.
The big question is whether we want to fix these lock contention problems by changing the data structures, or if we want to avoid the problem by having the direct reclaim path do less work. For example, having kswapd be the only process that dives into shrink_slab() might get any lock contention in that area out of the way.
The wrong pages get evicted
Swapping while the page cache is still big
Page cache IO (with the exception of mmaped executables and libraries) tends to have a lot of spatial locality in its access patterns, which means that disk IO can be done in fairly large chunks. Anonymous memory and shared memory segments, on the other hand, tend to be fairly fragmented internally.
This makes the cost per page of swap disk IO a lot higher than what would be the case for file IO. Because of this discrepancy in IO cost, page cache should be evicted more aggressively than anonymous and shmfs data.
Streaming data pushes often-accessed data out of the page cache
Currently the used-once algorithm in the page cache is not especially effective, because often-accessed pages can be pushed out by streaming IO. For one, the kernel ignores the referenced bit when moving page cache pages from the active to the inactive list. Secondly, the amount of pressure on the active list in relation to the inactive list is only related to the relative sizes of each list. We can probably do a better job of keeping the page cache working set cached.
(PeterZ says: http://programming.kicks-ass.net/kernel-patches/useonce-cleanup/ )
Page cache vs. anonymous memory
There may be grounds for treating anonymous memory and page cache differently. For page cache a used-once scheme is probably good, while for anonymous memory (everything starts out referenced, no referenced bit is ignored) something like SEQ replacement may be more appropriate.