This page describes a new page replacement design by Rik van Riel, Lee Schermerhorn and others. This design should meet the most important [:PageReplacementRequirements:page replacement requirements] as well as fix the VM behaviour in certain ProblemWorkloads.

If you spot any conceptual problems or oversights in this design, or have questions on the details, please email or IRC RikvanRiel. Once nobody can find holes, the concepts will probably work :)

High level overview


File cache

Anonymous memory

Non reclaimable

Recently evicted / nonresident

Merge plan

Old page contents

Design tenets

Design details

IO cost

The cost we want to measure is the number of disk seeks required for evicting and bringing back in each type of page (anonymous vs file backed). In many workloads, IO clustering will be able to bring this down to less than one disk seek per page, ie. multiple useful pages per disk seek.

The cost of IO of each kind of pages is the sum of the "in" and the "out" IO costs. We will probably want to work with a floating average (over the last 25% of the pages in the pool?) for each of these statistics.

Because most anonymous pages get instantiated "out of nowhere", the initial IO cost for anonymous pages will have to be initialized to something reasonable. Probably something like 1/4 to 1/2 of the maximum efficiency we could reach based on swap_cluster_max.

Pool sizing

When deciding whether one pool needs to grow (at the expense of the other), a number of factors can be taken into account:

The basic thing we want to measure here is the per-page pressure in each pool. The VM tries to equalize (pressure * IO cost) between the two pools.

The per page activity in each pool can be calculated: (fraction referenced * scan rate) / pool size

Pageout selection

Within each pool, the VM needs to select pages to evict. Large IO should not put undue pressure on the more actively used pages in the pool. An algorithm like CLOCK-Pro may be ideal here, but basically any pageout selection algorithm can be integrated into this basic design. This is something we can do later, leaving the current Linux pageout algorithm in place at first and only moving to a more scan resistant algorithm later in the future - if at all.

Incremental implementation

To avoid destabilizing Linux, the changes suggested above will need to be implemented, tested and maybe integrated one by one. Each change will need to be relatively contained and make sense all by itself.

Things can be implemented in this order:

  1. Measure the page utilization (% of referenced pages, scan rate, rotate rate, etc). Chances are these statistics are already pretty complete for /proc/vmstat. Maybe a decaying average or two will need to be added?
  2. Split anonymous vs. file backed pageout queues. (assume swap IO is 4x as expensive as file IO?)
  3. Actually measure the IO cost of each, and use that.
  4. Experiment with /proc/refault - does the data from refaults help with pool sizing in some workloads?
  5. Experiment with other replacement algorithms - do any of them give us a significant benefit on some workloads?


LinuxMM: PageReplacementDesign (last edited 2007-10-25 14:05:26 by bree)