Unfortunately the requirements for what a page replacement system should do are not clear. The best we can do is define the desired behavior by what it should NOT do, and try to go from there. If you have a workload and system configuration that totally break the Linux 2.6 memory management, please add it to the list.
Heavy anonymous memory
Synopsis: When memory is dominated by anonymous pages (JVM, big database) on large systems (16GB or larger), the Linux VM can end up spending many minutes scanning pages before swapping out the first page. The VM can take hours to recover from a memory crunch.
Kernel versions affected: 2.6.0 - current (no fix yet).
Details: Anonymous pages start their life cycle on the active list and referenced. As a consequence, the system will have a very large active list (millions of pages) that all have the recently-accessed bit set. The inactive list is tiny, typically around 1000 pages. The first few processes to dive into shrink_zone() will scan thousands of active pages, but will not even try to free a single inactive page:
zone->nr_scan_active += (zone_page_state(zone, NR_ACTIVE) >> priority) + 1; nr_active = zone->nr_scan_active; if (nr_active >= sc->swap_cluster_max) zone->nr_scan_active = 0; else nr_active = 0; zone->nr_scan_inactive += (zone_page_state(zone, NR_INACTIVE) >> priority) + 1; nr_inactive = zone->nr_scan_inactive; if (nr_inactive >= sc->swap_cluster_max) zone->nr_scan_inactive = 0; else nr_inactive = 0;
Because the first processes to go into try_to_free_pages() make no progress in shrink_zone(), they usually get joined by pretty much every other process in the system. Now you have hundreds of processes all scanning active pages, but not deactivating them yet because there are millions and they all have the accessed bit set. Lock contention on the zone->lru_lock and various other locks can slow things to a crawl.
Cases have been seen where it took over 10 minutes for the first process to call shrink_inactive_list()! By that time, the scanning priority of all processes has been reduced to a much smaller value, and every process in the system will try to reclaim a fairly large number of pages.
Sometimes pageout activity continues even when half of memory (or more) has already been freed, simply because of the large number of processes that are still active in the pageout code! In some workloads, it can take a few hours for the system to run normally again...