Unfortunately the requirements for what a page replacement system should do are not clear. The best we can do is define the desired behavior by what it should NOT do, and try to go from there. If you have a workload and system configuration that totally break the Linux 2.6 memory management, please add it to the list. = Heavy anonymous memory workload = Synopsis: When memory is dominated by anonymous pages (JVM, big database) on large systems (16GB or larger), the Linux VM can end up spending many minutes scanning pages before swapping out the first page. The VM can take hours to recover from a memory crunch. Kernel versions affected: 2.6.0 - current (no fix yet). Details: Anonymous pages start their life cycle on the active list and referenced. As a consequence, the system will have a very large active list (millions of pages) that all have the recently-accessed bit set. The inactive list is tiny, typically around 1000 pages. The first few processes to dive into shrink_zone() will scan thousands of active pages, but will not even try to free a single inactive page: {{{ zone->nr_scan_active += (zone_page_state(zone, NR_ACTIVE) >> priority) + 1; nr_active = zone->nr_scan_active; if (nr_active >= sc->swap_cluster_max) zone->nr_scan_active = 0; else nr_active = 0; zone->nr_scan_inactive += (zone_page_state(zone, NR_INACTIVE) >> priority) + 1; nr_inactive = zone->nr_scan_inactive; if (nr_inactive >= sc->swap_cluster_max) zone->nr_scan_inactive = 0; else nr_inactive = 0; }}} Because the first processes to go into try_to_free_pages() make no progress in shrink_zone(), they usually get joined by pretty much every other process in the system. Now you have hundreds of processes all scanning active pages, but not deactivating them yet because there are millions and they all have the accessed bit set. Lock contention on the zone->lru_lock and various other locks can slow things to a crawl. Cases have been seen where it took over 10 minutes for the first process to call shrink_inactive_list()! By that time, the scanning priority of all processes has been reduced to a much smaller value, and every process in the system will try to reclaim a fairly large number of pages. == Pageout won't stop == Synopsis: pageout activity continues even when lots of memory is already free. Kernel versions affected: 2.6.0 - current Details: Sometimes pageout activity continues even when half of memory (or more) has already been freed, simply because of the large number of processes that are still active in the pageout code! In some workloads, it can take a few hours for the system to run normally again... We will need to find a way to abort pageout activities when enough memory is free. On the other hand, higher-order allocations may want to continue to free pages for a while more. Maybe we need to pass the allocation order into try_to_free_pages() and call zone_watermark_ok() to test. Maybe Andrea Arcangeli's patches from June 8th 2007 are enough to fix this problem, especially "[PATCH 15 of 16] limit reclaim if enough pages have been freed". = Kswapd cannot keep up = During medium loads on the VM, kswapd will go to sleep in congestion_wait() even if many of the pages it ran into were freeable and no IO was started. This can easily happen if the inactive list is small and kswapd simply does not scan enough (or any!) inactive pages in shrink_zone(), focussing its efforts exclusively on moving pages from the active list to the inactive list. This has the effect of slowing down userspace, with processes having to do direct reclaim simply because kswapd went to sleep here: {{{ /* * OK, kswapd is getting into trouble. Take a nap, then take * another pass across the zones. */ if (total_scanned && priority < DEF_PRIORITY - 2) congestion_wait(WRITE, HZ/10); }}} The above code is buggy because kswapd goes to sleep even when it did not get into any trouble. Going to sleep here if a lot of IO has been queued is a good idea, but going to sleep just because kswapd has been working the active list instead of the inactive list causes trouble. = Lock contention = The direct reclaim path can cause lock contention in the VM, when multiple processes dive into the pageout code simultaneously. Every time one locking issue has been fixed, contention happens on the next lock in the series. || call path || fixed || problem details || resolution details || ||<-4> || || shrink_zone() -> zone->lru_lock || 2.6.0 || multiple CPUs taking and releasing the zone->lru_lock very rapidly || taking pages off the LRU and operating on batches || || page_referenced() -> ... -> page_referenced_one() -> mm->page_table_lock || 2.6.12 ? || this lock was more of a problem for the page fault path, but the problem is triggerable on pageout, too || split up the page table lock into one per page table page (ie. one lock per 2MB of process memory) instead of one per process || || page_referenced() -> page_referenced_anon() -> page_lock_anon_vma() || ? || in large, heavily threaded processes (usually a JVM) many threads can try to swap out pages from the same process simultaneously || turning the anon_vma lock into a read/write lock and taking it for read-only in page_referenced_anon() mitigates this issue || || try_to_free_pages() -> shrink_slab() -> prune_icache() -> iprune_mutex || || after fixing some small bugs in the pageout of normal pages, about 1000 to 1500 (out of 6500) processes during an AIM7 run got stuck waiting for this lock || || || try_to_free_pages() -> shrink_slab() -> prune_dcache() -> dcache_lock || || next on the list after fixing the inode_lock contention? || || The big question is whether we want to fix these lock contention problems by changing the data structures, or if we want to avoid the problem by having the direct reclaim path do less work. For example, having kswapd be the only process that dives into shrink_slab() might get any lock contention in that area out of the way. = The wrong pages get evicted = There are several variations of this problem. == Swapping while the page cache is still big == Page cache IO (with the exception of mmaped executables and libraries) tends to have a lot of spatial locality in its access patterns, which means that disk IO can be done in fairly large chunks. Anonymous memory and shared memory segments, on the other hand, tend to be fairly fragmented internally. This makes the cost per page of swap disk IO a lot higher than what would be the case for file IO. Because of this discrepancy in IO cost, page cache should be evicted more aggressively than anonymous and shmfs data. == Streaming data pushes often-accessed data out of the page cache == Currently the used-once algorithm in the page cache is not especially effective, because often-accessed pages can be pushed out by streaming IO. For one, the kernel ignores the referenced bit when moving page cache pages from the active to the inactive list. Secondly, the amount of pressure on the active list in relation to the inactive list is only related to the relative sizes of each list. We can probably do a better job of keeping the page cache working set cached. == Page cache vs. anonymous memory == || || data size || locality of reference || ||<-3> || || page cache || very large, sometimes the whole filesystem || spatial, pages near each other get referenced together (few pages get accessed over and over again) || || anonymous memory || smaller than or similar to the size of RAM || temporal, many pages get accessed over and over again, some more often than others || There may be grounds for treating anonymous memory and page cache differently. For page cache a used-once scheme is probably good, while for anonymous memory (everything starts out referenced, no referenced bit is ignored) something like SEQ replacement may be more appropriate. ---- CategoryAdvancedPageReplacement