= Linux Page Replacement Requirements =

As any other OS linux has the usual 'cheap' requirements for its page
replacement. However several other requirements make implementing page
replacement algorithms as found in papers/textbooks non-trivial.

To make matters worse, most of the page replacement algorithms out
there seem to only deal with identifying which pages to replace,
not with what to do once such pages have been identified.

[[TableOfContents]]

= Requirements shortlist =

 1. Must select good pages for eviction.
 1. Must not submit too much I/O at once. Submitting too much I/O at once can kill latency and even lead to deadlocks when bounce buffers (highmem) are involved.
 1. Must be able to efficiently evict the pages on which pageout I/O completed.
 1. Must be able to deal with multiple memory zones efficiently.
 1. Must always have some pages ready to evict. Scanning 32GB of "recently referenced" memory is not an option when memory gets tight.

= Pageout selection =

== Scan Resistant ==

A large sequential scan (eg. daily backup) should not be able to push the
working set out of memory.

== Effective as second level cache ==

The only hits in a second level cache are the cache misses from the primary
cache.  This means that the inter-reference distances on eg. a file server
may be very large.  A page replacement algorithm should be able to detect
and cache the most frequently accessed pages even in this case.

== Recency vs. Frequency ==

Which of the two is more important depends entirely on the workload.  It
would be nice if the pageout selection algorithm would adapt automatically.

= Environmental Considerations =

== Multiple Zones ==

Unlike most, linux has multiple memory zones; that is, memory is not viewed as
one big continuous section. There are specific sections of memory where it is
desirable to have the ability to free pages in. Think of NUMA topologies or
DMA engines that cannot operate on the full address space. Hence memory is
viewed in multiple zones.

For traditional page replacement algorithms this is not a big issue since we
just implement per zone page replacement; eg. a CLOCK per zone. However with
the introduction of non-resident page state tracking in the recent algorithms
this does become a problem. Since a page can fault into a different zone than
where it came from, the non-resident page state tracking needs to be over all
memory, not just a single zone.

This makes for per zone resident page tracking and global non-resident page
tracking; this separation is not present in several proposed algorithms and
hence makes implementing them a challenge.


== Asynchronous Page-Out ==

The page-out operation is not synchonous. Dirty pages that are selected for
reclaim are not directly freed, writeback is started against them (PG_writeback
is set) and they are fed back to the resident list. When on completion of the
write to their backing-store the reference bit is still unset a callback is
invoked to place them so that they are immediate candidates for reclaim again
(rotate_reclaimable_page).

When scanning for reclaimable pages make sure you are not stuck on a
writeback saturated list.


== Insert Referenced ==

Since we fault in pages it is per definition that the page is going to be used
(readahead?) right after we switch back to userspace. Hence we effectifly
insert page with their reference bit set. Since most algorithms assume we
insert pages with their reference bit unset the need arises to modify the
algorithms so that pages are not promoted on their first reference (use-once).


== Expensive Referenced Check ==

Because multiple page table entries can refer to the same physical page
checking the referenced bit is not as cheap as most algorithms assume it is
(rmap). Hence we need to do the check without holding most locks. This
suggests a batched approach to minimize the lock/unlock frequency. Modifying
algorithms to do this is not usualy very hard.

This page is part of CategoryAdvancedPageReplacement