Unfortunately the requirements for what a page replacement system should do are not clear. The best we can do is define the desired behavior by what it should NOT do, and try to go from there. If you have a workload and system configuration that totally break the Linux 2.6 memory management, please add it to the list.

Heavy anonymous memory workload

Synopsis: When memory is dominated by anonymous pages (JVM, big database) on large systems (16GB or larger), the Linux VM can end up spending many minutes scanning pages before swapping out the first page. The VM can take hours to recover from a memory crunch.

Kernel versions affected: 2.6.0 - current (no fix yet).

Details: Anonymous pages start their life cycle on the active list and referenced. As a consequence, the system will have a very large active list (millions of pages) that all have the recently-accessed bit set. The inactive list is tiny, typically around 1000 pages. The first few processes to dive into shrink_zone() will scan thousands of active pages, but will not even try to free a single inactive page:

        zone->nr_scan_active +=
                (zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
        nr_active = zone->nr_scan_active;
        if (nr_active >= sc->swap_cluster_max)
                zone->nr_scan_active = 0;
        else
                nr_active = 0;

        zone->nr_scan_inactive +=
                (zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
        nr_inactive = zone->nr_scan_inactive;
        if (nr_inactive >= sc->swap_cluster_max)
                zone->nr_scan_inactive = 0;
        else
                nr_inactive = 0;

Because the first processes to go into try_to_free_pages() make no progress in shrink_zone(), they usually get joined by pretty much every other process in the system. Now you have hundreds of processes all scanning active pages, but not deactivating them yet because there are millions and they all have the accessed bit set. Lock contention on the zone->lru_lock and various other locks can slow things to a crawl.

Cases have been seen where it took over 10 minutes for the first process to call shrink_inactive_list()! By that time, the scanning priority of all processes has been reduced to a much smaller value, and every process in the system will try to reclaim a fairly large number of pages.

Pageout won't stop

Synopsis: pageout activity continues even when lots of memory is already free.

Kernel versions affected: 2.6.0 - current

Details: Sometimes pageout activity continues even when half of memory (or more) has already been freed, simply because of the large number of processes that are still active in the pageout code! In some workloads, it can take a few hours for the system to run normally again...

We will need to find a way to abort pageout activities when enough memory is free. On the other hand, higher-order allocations may want to continue to free pages for a while more. Maybe we need to pass the allocation order into try_to_free_pages() and call zone_watermark_ok() to test. Maybe Andrea Arcangeli's patches from July 8th 2007 are enough to fix this problem, especially "[PATCH 15 of 16] limit reclaim if enough pages have been freed".

Kswapd cannot keep up

During medium loads on the VM, kswapd will go to sleep in congestion_wait() even if many of the pages it ran into were freeable and no IO was started. This can easily happen if the inactive list is small and kswapd simply does not scan enough (or any!) inactive pages in shrink_zone(), focussing its efforts exclusively on moving pages from the active list to the inactive list.

This has the effect of slowing down userspace, with processes having to do direct reclaim simply because kswapd went to sleep here:

                /*
                 * OK, kswapd is getting into trouble.  Take a nap, then take
                 * another pass across the zones.
                 */
                if (total_scanned && priority < DEF_PRIORITY - 2)
                        congestion_wait(WRITE, HZ/10);

The above code is buggy because kswapd goes to sleep even when it did not get into any trouble. Going to sleep here if a lot of IO has been queued is a good idea, but going to sleep just because kswapd has been working the active list instead of the inactive list causes trouble.

Lock contention

The direct reclaim path can cause lock contention in the VM, when multiple processes dive into the pageout code simultaneously. Every time one locking issue has been fixed, contention happens on the next lock in the series.

call path	fixed	problem details	resolution details

shrink_zone() -> zone->lru_lock	2.6.0	multiple CPUs taking and releasing the zone->lru_lock very rapidly	taking pages off the LRU and operating on batches
page_referenced() -> ... -> page_referenced_one() -> mm->page_table_lock	2.6.12 ?	this lock was more of a problem for the page fault path, but the problem is triggerable on pageout, too	split up the page table lock into one per page table page (ie. one lock per 2MB of process memory) instead of one per process
page_referenced() -> page_referenced_anon() -> page_lock_anon_vma()	?	in large, heavily threaded processes (usually a JVM) many threads can try to swap out pages from the same process simultaneously	turning the anon_vma lock into a read/write lock and taking it for read-only in page_referenced_anon() mitigates this issue
try_to_free_pages() -> shrink_slab() -> prune_icache() -> iprune_mutex		after fixing some small bugs in the pageout of normal pages, about 1000 to 1500 (out of 6500) processes during an AIM7 run got stuck waiting for this lock
try_to_free_pages() -> shrink_slab() -> prune_dcache() -> dcache_lock		next on the list after fixing the inode_lock contention?

The big question is whether we want to fix these lock contention problems by changing the data structures, or if we want to avoid the problem by having the direct reclaim path do less work. For example, having kswapd be the only process that dives into shrink_slab() might get any lock contention in that area out of the way.

The wrong pages get evicted

There are several variations of this problem.

Swapping while the page cache is still big

Page cache IO (with the exception of mmaped executables and libraries) tends to have a lot of spatial locality in its access patterns, which means that disk IO can be done in fairly large chunks. Anonymous memory and shared memory segments, on the other hand, tend to be fairly fragmented internally.

This makes the cost per page of swap disk IO a lot higher than what would be the case for file IO. Because of this discrepancy in IO cost, page cache should be evicted more aggressively than anonymous and shmfs data.

Streaming data pushes often-accessed data out of the page cache

Currently the used-once algorithm in the page cache is not especially effective, because often-accessed pages can be pushed out by streaming IO. For one, the kernel ignores the referenced bit when moving page cache pages from the active to the inactive list. Secondly, the amount of pressure on the active list in relation to the inactive list is only related to the relative sizes of each list. We can probably do a better job of keeping the page cache working set cached.

Page cache vs. anonymous memory

	data size	locality of reference

page cache	very large, sometimes the whole filesystem	spatial, pages near each other get referenced together (few pages get accessed over and over again)
anonymous memory	smaller than or similar to the size of RAM	temporal, many pages get accessed over and over again, some more often than others

There may be grounds for treating anonymous memory and page cache differently. For page cache a used-once scheme is probably good, while for anonymous memory (everything starts out referenced, no referenced bit is ignored) something like SEQ replacement may be more appropriate.

CategoryAdvancedPageReplacement