This page describes the Split LRU page replacement design by Rik van Riel, Lee Schermerhorn, Kosaki Motohiro and others. It was merged in the 2.6.28 kernel and has continued to receive fixes and refinements until it started working really well, around 2.6.32. The tracking of recently reclaimed pages has not been integrated yet, a simpler way of sizing the various LRU lists was implemented.

The problem

The old page replacement mechanism in Linux (up to and including 2.6.27) has two major classes of problems:

More details on the problem space can be found on ProblemWorkloads and page replacement requirements.

High level overview


File cache

Anonymous memory

Non reclaimable

Recently evicted / nonresident

OOM killer

Merge plan

Old page contents

Design tenets

Design details

IO cost

The cost we want to measure is the number of disk seeks required for evicting and bringing back in each type of page (anonymous vs file backed). In many workloads, IO clustering will be able to bring this down to less than one disk seek per page, ie. multiple useful pages per disk seek.

The cost of IO of each kind of pages is the sum of the "in" and the "out" IO costs. We will probably want to work with a floating average (over the last 25% of the pages in the pool?) for each of these statistics.

Because most anonymous pages get instantiated "out of nowhere", the initial IO cost for anonymous pages will have to be initialized to something reasonable. Probably something like 1/4 to 1/2 of the maximum efficiency we could reach based on swap_cluster_max.

Pool sizing

When deciding whether one pool needs to grow (at the expense of the other), a number of factors can be taken into account:

The basic thing we want to measure here is the per-page pressure in each pool. The VM tries to equalize (pressure * IO cost) between the two pools.

The per page activity in each pool can be calculated: (fraction referenced * scan rate) / pool size

Pageout selection

Within each pool, the VM needs to select pages to evict. Large IO should not put undue pressure on the more actively used pages in the pool. An algorithm like CLOCK-Pro may be ideal here, but basically any pageout selection algorithm can be integrated into this basic design. This is something we can do later, leaving the current Linux pageout algorithm in place at first and only moving to a more scan resistant algorithm later in the future - if at all.

Incremental implementation

To avoid destabilizing Linux, the changes suggested above will need to be implemented, tested and maybe integrated one by one. Each change will need to be relatively contained and make sense all by itself.

Things can be implemented in this order:

  1. Measure the page utilization (% of referenced pages, scan rate, rotate rate, etc). Chances are these statistics are already pretty complete for /proc/vmstat. Maybe a decaying average or two will need to be added?
  2. Split anonymous vs. file backed pageout queues. (assume swap IO is 4x as expensive as file IO?)
  3. Actually measure the IO cost of each, and use that.
  4. Experiment with /proc/refault - does the data from refaults help with pool sizing in some workloads?
  5. Experiment with other replacement algorithms - do any of them give us a significant benefit on some workloads?

LinuxMM: PageReplacementDesign (last edited 2017-12-30 01:05:11 by localhost)