Unfortunately the requirements for what a page replacement system should do are not clear. The best we can do is define the desired behavior by what it should NOT do, and try to go from there. If you have a workload and system configuration that totally break the Linux 2.6 memory management, please add it to the list.

Heavy anonymous memory workload

Synopsis: When memory is dominated by anonymous pages (JVM, big database) on large systems (16GB or larger), the Linux VM can end up spending many minutes scanning pages before swapping out the first page. The VM can take hours to recover from a memory crunch.

Kernel versions affected: 2.6.0 - current (no fix yet).

Details: Anonymous pages start their life cycle on the active list and referenced. As a consequence, the system will have a very large active list (millions of pages) that all have the recently-accessed bit set. The inactive list is tiny, typically around 1000 pages. The first few processes to dive into shrink_zone() will scan thousands of active pages, but will not even try to free a single inactive page:

        zone->nr_scan_active +=
                (zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
        nr_active = zone->nr_scan_active;
        if (nr_active >= sc->swap_cluster_max)
                zone->nr_scan_active = 0;
        else
                nr_active = 0;

        zone->nr_scan_inactive +=
                (zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
        nr_inactive = zone->nr_scan_inactive;
        if (nr_inactive >= sc->swap_cluster_max)
                zone->nr_scan_inactive = 0;
        else
                nr_inactive = 0;

Because the first processes to go into try_to_free_pages() make no progress in shrink_zone(), they usually get joined by pretty much every other process in the system. Now you have hundreds of processes all scanning active pages, but not deactivating them yet because there are millions and they all have the accessed bit set. Lock contention on the zone->lru_lock and various other locks can slow things to a crawl.

Cases have been seen where it took over 10 minutes for the first process to call shrink_inactive_list()! By that time, the scanning priority of all processes has been reduced to a much smaller value, and every process in the system will try to reclaim a fairly large number of pages.

Sometimes pageout activity continues even when half of memory (or more) has already been freed, simply because of the large number of processes that are still active in the pageout code! In some workloads, it can take a few hours for the system to run normally again...

Kswapd cannot keep up

During medium loads on the VM, kswapd will go to sleep in congestion_wait() even if many of the pages it ran into were freeable and no IO was started. This can easily happen if the inactive list is small and kswapd simply does not scan enough (or any!) inactive pages in shrink_zone(), focussing its efforts exclusively on moving pages from the active list to the inactive list.

This has the effect of slowing down userspace, with processes having to do direct reclaim simply because kswapd went to sleep here:

                /*
                 * OK, kswapd is getting into trouble.  Take a nap, then take
                 * another pass across the zones.
                 */
                if (total_scanned && priority < DEF_PRIORITY - 2)
                        congestion_wait(WRITE, HZ/10);

The above code is buggy because kswapd goes to sleep even when it did not get into any trouble. Going to sleep here if a lot of IO has been queued is a good idea, but going to sleep just because kswapd has been working the active list instead of the inactive list causes trouble.

CategoryAdvancedPageReplacement