Linux implements an LRU based page replacement algorithm. Although simple, it is suboptimal for a wide range of important workloads.
At the present moment there are a few experimental implementations of advanced replacement algorithms being developed and discussed.
This page attempts to detail known LRU deficiencies, along with benchmarks to highlight them.
In addition to benchmarks, UMass has a Trace Repository, with CPU/memory, network and I/O traces. This can be useful to unit-test replacement algorithms, free from interference by other parts of the OS.
The most significant issue with LRU is that it uses too little information to base the replacement decision: recency alone.
It does not take frequency of page accesses into account.
Here is one example from the LRU-K paper.
"Consider a multi-user database application, which references randomly chosen customer records through a clustered B-tree indexed key, CUST-ID, to retrieve desired information. Assume simplistically that 20,000 customers exist, that a customer record is 2000 bytes in length, and that space needed for the B-tree index at the leaf level, free space included, is 20 bytes for each key entry. Then if disk pages contain 4000 bytes of usable space and can be packed full, we require 100 pages to hold the leaf level nodes of the B-tree index (there is a single B-tree root node), and 10,000 pages to hold the records. The pattern of reference to these pages (ignoring the B-tree root node) is clearly: 11, R1, 12, R2, 13, R3, ... alternate references to random index leaf pages and record pages. If we can only afford to buffer 101 pages in memory for this application, the B-tree root node is automatic; we should buffer all the B-tree leaf pages, since each of them is referenced with a probability of .005 (once in each 200 general *page* references), while it is clearly wasteful to displace one of these leaf pages with a data *page*, since data pages have only .00005 probability of reference (once in each 20,000 general *page* references). Using the LRU *algorithm*, however, the pages held in memory buffers will be the hundred most recently referenced ones. To a first approximation, this means 50 B-tree leaf pages and 50 record pages. Given that a *page* gets no extra credit for being referenced twice in the recent past and that this is more likely to happen with B-tree leaf pages, there will even be slightly more data pages present in memory than leaf pages. This is clearly inappropriate behavior for a very common paradigm of disk accesses."
Its possible to reproduce the above scenario using "miniDB", a small and simple database program which indexes its records in a B-tree. Here you can find a modified mdb-2.1 tarball which contains a series of small scripts and some modifications to help benchmarking.
- unpack the tarball, run "make".
- create a directory 'data', run "mkdir data"
- Create and fill a large database. By default 1.2 million records are created, each record holds a bit more than 1024 bytes, resulting in an index of ~70Mb and a data file of ~700Mb.
- # sh create_largedb.sh ; sh fill_largedb.sh
- Run the test, performing 1.2 million queries in a randomic fashion:
- # ./mdb -x 1200000
Hardware: UP Athlon 1600+ booted with 96Mb RAM.
Locking the index file on memory exhibits 3% improvement on execution time.
Note that an "rsync" of a 250Mb kernel tree was running on the background during the mlock test, while plain 2.6.13 system was dedicated to the benchmark. So in practice the improvement of this test running on this system is likely more than 3%.
A frequency+recency based page replacement algorithm should retain the higher frequency pages (from the index file in our case) in memory, achieving similar improvements to what was achieved with mlock.
Large sequential scans
Another common scenario which LRU underperforms are large sequential use-once scans.
Since the pages being brought in memory have higher recency than the working set, LRU happily evicts the working set in favor of the large data set.
Linux contains heuristics to workaround that issue, which should not be necessary once frequency based replacement is in place.
A benchmark is available here: scan.c cfile. It simple, it does scans of anonymous mapped memory, both cyclic and use once.
# ./scan <map size> <loops>