Pageout and kswapd
By - Ameet Patil (comments to donamya 'AT' yahoo.com).
This document tries to educate the reader about the working of the page out operations mainly carried out by the kswapd deamon thread. I will also touch upon the page replacement policy used in linux.
NOTE: the description and code snippets presented here correspond to the linux 2.6.x kernel only.
Contents
Pageout
"The operation performed by the kernel to evict a page resident in physical memory to the swap space on a secondary memory device (eg. typically a disk)"
Swap space
"The region or area on a slower (with respect to the RAM) secondary device used to store evicted pages from the physical memory. Linux uses one or more separate partition(s) specifically for swap space. However, Linux can also use a single file as its swap space. While on other OSs like Windows for example the swap space is typically a large single or multiple files on the exisitng windows partition(s)."
Swap Cache
"In order to improve performance by reducing the number of disk accesses (both read/writes), the linux kernel implements the swap cache. This essentially is a cache of most of the pages evicted that are either waiting to be written to the secondary swap space or got recently read back into memory. This allows the Linux kernel to perform simple physical readahead on the swap area, without needing to figure out which process each swap page belongs to. The swap cache is also useful for forking daemons, like sendmail. It is likely that the child processes will use different routines than the parent process, which means it is quite possible that, under very heavy system load, pages from the parent process get swapped out that will be needed by every child process. The swap cache means the child process page will only have to be brought back in from disk once - after that the other child processes can get it from the swap cache."
Efficient use of Memory
Linux, like most Unix operating systems, tries to use memory as efficiently as possible. That is, all memory that is not in use by the kernel or processes may be used as file cache, to reduce the number of disk accesses the system has to do. One consequence of this is that a busy Linux system will constantly run with most of its memory in use, and most memory allocations mean that another page had to be evicted from memory by the pageout code.
Under typical workloads, most of the memory will be in use by the page cache and by processes, which get their memory on demand after a page fault, see PageFaultHandling. The memory allocator (PageAllocation) will allocate a free page, and activate the pageout code if the number of free pages has fallen too low.
Asynchronous and Synchronous Pageout
Most of the time, the rate of page allocations is relatively low. In this case, the kswapd kernel thread can free memory as fast as it is allocated and the allocating processes can immediately get a memory page when needed, without having to wait for the pageout code. Having kswapd free memory in the background helps the applications run at maximum speed.
However, sometimes the rate of page allocation is so high that kswapd can not keep up. When that happens, the applications that are allocating pages will help free pages themselves, by calling the function try_to_free_pages. This has the effect of throttling the heavy memory allocators and (on NUMA systems) focussing the pageout code on those memory zones which the heavily allocating processes allocate from.
The kswapd() Deamon Thread
This is a kernel deamon thread invoked at boot-up in kswapd_init() which is responsible for maintaining a constant balance of the number availabe free pages in the physical memory at any time. The thread sleeps most of the times and is only invoked when there is a need to evict pages from memory. The main body of this thread is in function kswapd() in the file mm/vmscan.c . The kswapd thread in woken up by the physical page allocator only when the number of available free pages is less then pages_low (a variable declared as unsigned long in file include/linux/mmzone.h). The value of variable pages_low depends on the number of pages in a particular zone. This is calculated as:
zone->pages_low = (zone->pages_min * 5) / 4; /* in file mm/page_alloc.c */
The deamon thread when invoked free pages until the pages_high mark is reached. Following code snippet below shows the main loop of this thread (extracted from mm/vmscan.c) .
for ( ; ; ) { unsigned long new_order; try_to_freeze(); prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); new_order = pgdat->kswapd_max_order; pgdat->kswapd_max_order = 0; if (order < new_order) { /* * Don't sleep if someone wants a larger 'order' * allocation */ order = new_order; } else { schedule(); order = pgdat->kswapd_max_order; } finish_wait(&pgdat->kswapd_wait, &wait); balance_pgdat(pgdat, 0, order); }
Initially the thread tries to check if it can sleep. If there is a need to free/evict some pages from memory, then it prepares itself by calling finish_wait(&pgdat->kswapd_wait, &wait); where it removes itself from the wait list. Next it calls the function balance_pgdat() explained below.
balance_pgdat()
The objective of this function is two things:
(1) If the parameter nr_pages is ZERO, then it has to free/reclaim pages such that the number of free available pages goes above pages_high.
(2) If the parameter nr_pages is greater than ZERO, then it has to free/reclaim nr_pages from memory. In the first loop, the zone(s) in which pages need to be reclaimed is/are determined. Later shrink_zone() is called to reclaim either nr_pages or SWAP_CLUSTER_MAX pages from a particular zone. After this, the function shrink_slab() is called which reclaims unused pages from the kernel space (usually inodes and some other data structures used by the kernel).
try_to_free_pages()
The try_to_free_pages() function is called directly by the allocating process/task when there is a serious problem with available free memory pages. This happens in heavy system load when the kswapd thread cannot keep pace with the process that are hungry for pages. These processes keep requesting for more and more pages even before the kswapd deamon is able to free some for allocation. In this situation the allocator itself calls this function to rapidly free up some pages if possible. This function by-passes the kswapd thread operation and tries to free pages in parallel. The operation is very much similar to balance_pgdat(), but instead of calling shrink_zone(), the shrink_caches() function is called which intern calls shrink_zone(). Also, the pdflush thread (responsible to write back evited pages into the swap space) is woken up here if its sleeping.
shrink_zone
This function is used by both kswapd thread as well as the direct free operation via try_to_free_pages() call. It mainly is responsible for freeing up pages in a particular zone passed to it as a parameter. The main code lies in:
while (nr_active || nr_inactive) { if (nr_active) { sc->nr_to_scan = min(nr_active, (unsigned long)sc->swap_cluster_max); nr_active -= sc->nr_to_scan; refill_inactive_zone(zone, sc); } if (nr_inactive) { sc->nr_to_scan = min(nr_inactive, (unsigned long)sc->swap_cluster_max); nr_inactive -= sc->nr_to_scan; shrink_cache(zone, sc); if (sc->nr_to_reclaim <= 0) break; } }
The while loop scans for all the active and inactive pages. For all active pages, the function refill_inactive_zone() is called to decide if they need to be moved to inactive list. And for all inactive pages, the function pageout() is called to pageout some pages in the inactive pages existing in the swap cache.
'''Question:''' what does throttle_vm_writeout() at the end of shrink_zone() do? '''Answers on #mm IRC:''' '''marcelo:''' throttle's the vm write out :) to avoid many pages in-flight (which can result in OOM), take a look at the function. '''riel:''' makes kswapd wait instead of giving up after scanning too much, or simply wasting too much CPU
shrink_caches
shrink_cache
shrink_list
pageout
swap_page
refill_inactive_zone
Shrink Slab
Most kernel allocations are done through the slab allocator. Some kernel allocations are caches (eg. inodes and dentries), parts of which can be freed by the pageout code. However, since each slab has its own data structure, they all need their own replacement algorithms, which are separate from the replacement algorithm used for page cache and process memory. Usually the slab occupies a fairly small part of memory, and replacement of kernel data structures is beyond the scope of this document.