Over the last few years, Linux memory management bugs have been resolved as they came in, one by one. However, it has been quite common for some classes of bugs to get fixed and reintroduced repeatedly.
The speed gap becween memory and hard disks is increasing, with disk latencies being tens of millions of CPU cycles. Additionally, large memory systems (>64GB) are becoming more and more common, and present their own set of scalability challenges.
Maybe it is time for us to understand all the constraints a page replacement mechanism has to satisfy, instead of fixing the bugs one by one? At the very least, this page could turn into a list of "do"s and "don't"s for the VM that can be amended as we go.
Must not submit too much I/O at once. Submitting too much I/O at once can kill latency and even lead to deadlocks when bounce buffers (highmem) are involved. Note that submitting sequential I/O is a good thing.
For more problems that need fixing, see the list of problem workloads.
- Requirements shortlist
- Pageout selection
- Limited pageout I/O
- Multiple Zones
- Background aging
- Batch processing
- Low overhead of execution
- Tuning Knobs
- Other considerations
- User requests
Effective as second level cache
The only hits in a second level cache are the cache misses from the primary cache. This means that the inter-reference distances on eg. a file server may be very large. A page replacement algorithm should be able to detect and cache the most frequently accessed pages even in this case.
Recency vs. Frequency
The use once algorithm currently in the 2.6 kernel does the wrong thing in some use cases. For example, rsync can briefly touch the same pages twice, and then never again. In this case, the pages should not get promoted to the active list.
For page replacement purposes "referenced twice" should mean that the page was referenced in two time slots during which the VM scanned the page referenced bit, so "referenced twice" is counted the same for page tables as it is for page structs.
Limited pageout I/O
Pageout I/O is submitted as pages hit the end of the LRU list. Dirty pages are then rotated back onto the start of inactive list. Not only does this disturb LRU order, but it can result in hundreds of megabytes worth of small I/Os being submitted at once. This kills I/O latency and can lead to deadlocks on 32 bit systems with highmem, where the kernel needs to allocate bounce buffers and/or buffer heads from low memory.
Reclaim after I/O
The rotate_reclaimable_page() mechanism in the current 2.6 kernels fixes part of the problem by moving pages back to the end of the inactive list when IO finishes, but there is no effective mechanism to limit how much I/O is submitted at once.
The importance of sequential I/O
Since most disk writes are seek time dominated, the VM should aim to do sequential/clustered writeouts, as well as refrain from submitting too much pageout I/O at once. If the VM wants to free 10MB of memory, it should not submit 500MB worth of I/O, just because there are that many pages on the inactive list.
The page-out operation is not synchonous. Dirty pages that are selected for reclaim are not directly freed, writeback is started against them (PG_writeback is set) and they are fed back to the resident list. When on completion of the write to their backing-store the reference bit is still unset a callback is invoked to place them so that they are immediate candidates for reclaim again (rotate_reclaimable_page).
Unlike most, linux has multiple memory zones; that is, memory is not viewed as one big continuous section. There are specific sections of memory where it is desirable to have the ability to free pages in. Think of NUMA topologies or DMA engines that cannot operate on the full address space. Hence memory is viewed in multiple zones.
For traditional page replacement algorithms this is not a big issue since we just implement per zone page replacement; eg. a CLOCK per zone. However with the introduction of non-resident page state tracking in the recent algorithms this does become a problem. Since a page can fault into a different zone than where it came from, the non-resident page state tracking needs to be over all memory, not just a single zone.
To avoid these situations, the system should always have some pages on hand that are good candidates to be evicted. Light background aging of pages may be one solution to get the desired result. There may be others.
Unlike many other subsystems, which are optimized for the common case, the VM also needs to be optimized for the worst case. This is because the latency difference between RAM and disk can be tens of millions of CPU cycles.
All heuristics will do the wrong thing occasionally, and the VM is no exception. However, there should be mechanisms (probably feedback loops) to stop the VM from blindly continuing down the wrong path and turning a single mistake into a worst case scenario.
Low overhead of execution
Evicting the wrong pages can be extremely costly, reducing system performance by orders of magnitude. However, the VM also cannot go overboard in trying to analyze what is going on and selecting pages to evict. The algorithms used for pageout selection cannot scan the page structs and page tables too often, otherwise they will end up wasting too much CPU. This is especially true when thinking about large memory systems, 128GB RAM is not that strange any more in 2007, and 1TB systems will probably be common within a few years.
Expensive Referenced Check
Because multiple page table entries can refer to the same physical page checking the referenced bit is not as cheap as most algorithms assume it is (rmap). Hence we need to do the check without holding most locks. This suggests a batched approach to minimize the lock/unlock frequency. Modifying algorithms to do this is not usualy very hard.
Due to the increasing speed gap between memory and disk, the increasing memory capacity and the increasing complexity of large systems, VM developers cannot pawn off responsibility for a working system onto the system administrator by providing dozens of tuning knobs.
Since we fault in pages it is per definition that the page is going to be used (readahead?) right after we switch back to userspace. Hence we effectifly insert page with their reference bit set. Since most algorithms assume we insert pages with their reference bit unset the need arises to modify the algorithms so that pages are not promoted on their first reference (use-once).
Some of these requests should be taken with a grain of salt. Not because the users have no genuine need for a fix to their problem, but because alternative solutions may be possible (and sometimes better).
Page cache size limits
Containers / selective reclaim
Container technologies, like CKRM and userbeans, want to be able to limit the amount of memory a group of processes can take. This would require the VM to evict pages from only a certain group of processes.
This page is part of CategoryAdvancedPageReplacement