LinuxMM:

It's a common dilemma: I just got a brand new Linux machine, loaded it up with lots of expensive RAM, and left it for a day. Now, it's out of memory, or it is swapping! It definitely has enough RAM, so there must be a bug in Linux!I assure you, in almost all cases, your system has plenty of RAM. However, where is all of that RAM going? For what does Linux use all of it? How can I actually tell that I'm out of RAM? Unfortunately, Linux can make these very hard questions to answer. This article will explain in detail many of the ways that Linux uses RAM for things other than user data and how you can tell when your system is _actually_ out of RAM.

---

Linux has this basic rule: a page of free RAM is wasted RAM. RAM is used for a lot more than just user application data. It also stores data for the kernel itself and, most importantly, can mirror data stored on the disk for super-fast access. These in-memory mirrors are very important because of how much faster RAM is to access than the disk. Ever notice how long it takes to start up a web browser the first time after your system boots? Have you ever loaded it a second time and had it pop up almost immediately? The greatly reduced start time is because of these in-memory mirrors of on-disk data. These mirrors take several forms in Linux, so let's examine each of them. These data are largely enumerated in the /proc/meminfo file, and we will refer to its contents regularly.

Here is the output from the author's 2GB laptop running kernel version 2.6.20:

MemTotal:      2073564 kB
MemFree:       1259628 kB
Buffers:         27924 kB
Cached:         176764 kB
SwapCached:     285188 kB
Active:         562120 kB
Inactive:       145592 kB
HighTotal:     1179008 kB
HighFree:       562948 kB
LowTotal:       894556 kB
LowFree:        696680 kB
SwapTotal:     1992052 kB
SwapFree:      1167632 kB
Dirty:            9052 kB
Writeback:           0 kB
AnonPages:      437520 kB
Mapped:          49800 kB
Slab:            91332 kB
SReclaimable:    64816 kB
SUnreclaim:      26516 kB
PageTables:       4872 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   3028832 kB
Committed_AS:  2402708 kB
VmallocTotal:   114680 kB
VmallocUsed:      6112 kB
VmallocChunk:   108044 kB

Your /proc/meminfo file may contain different entries than this. The kernel developers have been slowly adding to this file over the years and it has grown. Some distributions also add their own custom entries to this file. Do not worry if your file differs slightly from this one.

Page Cache

The Linux Page Cache ("Cached:" from meminfo ) is the largest single consumer of RAM on most systems. Any time you do a read() from a file on disk, that data is read into memory, and goes into the page cache(1.). After this read() completes, the kernel has the option to simply throw the page away since it is not being used. However, if you do a second read of the same area in a file, the data will be read directly out of memory and no trip to the disk will be taken. This is an incredible speedup and is the reason why Linux uses its page cache so extensively: it is betting that after you access a page on disk a single time, you will soon access it again.

The same is true for mmap()'d files ("Mapped:" in meminfo). The first time the mmap()'d area is accessed, the page will be brought in from the disk and mapped into memory. The kernel could choose to immediately discard that page after the instruction that touched the page has completed. However, the kernel makes the same bets that it did for simple read()s of the file. It keeps the page mapped into memory and bets that you will soon access it again.

This causes confusion for some people. They assume that mmap()'d memory is not "cached" because it is in active use. They assume that "cached" means "completely unused right now". However, Linux does not define it that way. The Linux definition is closer to "this is a copy of data from the disk that we have here to save you time". It does not implies nothing about how the page is actually being used. This is why we have both "Cached:" and "Mapped:" in meminfo. All "Mapped:" memory is "Cached:", but not all "Cached:" memory is "Mapped:".

dentry/inode caches

Each time you do an 'ls' (or any other operation: open(), stat(), etc...) on a filesystem, the kernel needs data which are on the disk. The kernel parses these data on the disk and puts it in some filesystem-independent structures so that it can be handled in the same way across all different filesystems. In the same fashion as the page cache in the above examples, the kernel has the option of throwing away these structures once the 'ls' is completed. However, it makes the same bets as before: if you read it once, you're bound to read it again. The kernel stores this information in several "caches" called the dentry and inode caches. dentries are common across all filesystems, but each filesystem has its own cache for inodes. You can view the different caches and their sizes by executing this command:

head -2 /proc/slabinfo; cat /proc/slabinfo  | egrep dentry\|inode

(this ram is a component of "Slab:" in meminfo)

Older kernels (around 2.6.9) left some structures in the slab cache for longer than new kernels do. That means that even though they may have been quite unused, they're left until there is memory pressure. This happens especially with proc_inodes. /proc inodes also happen to pin task_structs, which means that each one can effectively occupy over 2 KBytes of RAM. This RAM won't show up as 'Cached' and may appear to be a kernel memory leak. On a system with only 100 tasks (with little memory pressure) there can be hundreds of thousands of these left laying around.

They're harmless. But, on the surface, might manifest as a kernel memory leak. To be sure, try this command:

{{{echo 2 > /proc/sys/vm/drop_caches }}}

If the numbers of task_struct and proc_inode_cache objects decrease, then there's no real bug.

Buffer Cache

The buffer cache ("Buffers:" in meminfo) is a close relative to the dentry/inode caches. The dentries and inodes in memory represent structures on disk, but are laid out very differently. This might be because we have a kernel structure like a pointer in the in-memory copy, but not on disk. It might also happen that the on-disk format is a different endianness than CPU.

In any case, when we need to fetch an inode or dentry to populate the caches, we must first bring in a the page from disk on which those structures are represented. This can not be a part of the page cache because it is not actually the contents of a file, rather it is the raw contents of the disk. A page in the buffer cache may have dozens of on-disk inodes inside of it, although we only created an in-memory inode for one. The buffer cache is, again, a bet that the kernel will need another in the same group of inodes and will save a trip to the disk by keeping this buffer page in memory.

Running out of memory

Now that you know all of these wonderful uses for your free RAM, you have to be wondering: what happens when there is no more free RAM? If I have no memory free, and I need a page for the page cache, inode cache, or dentry cache, where do I get it?

First of all the kernel tries not to let you get close to 0 bytes of free RAM. This is because, to free up RAM, you usually need to allocate more. Have you ever gone to start a large project at your desk, and realized that you needed to clear up an area before going to work? The kernel needs the same kind of "working space" for its own housekeeping.

Based on the amount of RAM and the different types (high/low memory (2.)), the kernel comes up with a heuristic for the amount of memory that it feels comfortable with as its working space. When it reaches this watermark, the kernel starts to reclaim memory from the different uses described above. The kernel can get memory back from any of the these.

However, there is another user of memory that we may have forgotten about by now: user application data.

Swapping

meminfo entries:

SwapTotal:     1992052 kB
SwapFree:      1167632 kB

When the kernel decides not to get memory from any of the other sources we've described so far, it starts to swap. During this process it takes user application data and writes it to a special place (or places) on the disk. You might think that this should only be a last resort once we are completely unable to free any of the other types of RAM. However, the kernel does not do it this way. Why?

Consider an application like /sbin/init. It has some incredibly important duties like setting up the system at startup and respawning login prompts if they die. But, how much of its data is actually used during the normal runtime of the system? If the system is at its limits and just about out of RAM, should we swap out a completely unused since bootup page of /sbin/init's data and use that page for page cache? Or should we keep /sbin/init completely in memory and force the potential page cache user to go to the disk?

The kernel will often choose to swap out /sbin/init's data in favor of the current needs of the currently running applications. For this reason, even a system with vast amounts of RAM (even when properly tuned) can swap. There are lots of pages of memory which are user application data, but are rarely used. All of these are targets for being swapped in favor of other uses for the RAM.

But, if the mere presence of used swap is not evidence of a system which has too little RAM for its workload, what is? As you can see, swap is most efficiently used for data which will not be accessed for a long time. If data in swap is being constantly accessed, then it is failing to be used effectively. We can monitor the amount of data going in and out of swap with the vmstat command. The following will produce output ever 5 seconds:

$ vmstat 5
procs -----------memory---------- ---swap--- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si    so    bi    bo   in   cs us sy id wa
 3  0 833704  54824  25196 328672   10     0   343    18  510 1382 96  4  0  0
 6  0 833704  54556  25092 324584    0     0   333    22  504 1180 93  7  0  0
 4  0 833704  51516  25112 320856   33     0   315    19  508 1234 95  5  0  0
 3  0 833704  54836  24984 314404    6     0   223    27  498 1191 95  5  0  0
 3  0 833704  53072  24944 307844    4     0   216    22  518 1375 96  4  0  0
 5  0 833704  53928  24888 304076    6     0   262    18  548 1665 94  6  0  0
 3  4 843964  50192    184  58064   16  2416    16  2464  570 1451 78 22  0  0
 3  7 908244  48756    224  47760  118 13645   149 13664  730 1245 76 16  0  8
 3  2 922064  54280    340  49228 1470  2838  1817  2865  711 1481 88 12  0  0
 4  2 932644  54068    424  52204 1972  2195  2596  2211  678 1388 90 10  0  0
 2  3 944012  56304    492  52292 2986  2591  3063  2615  735 1562 89 11  0  0
 2  4 957304  54604    572  51964 4042  3414  4096  3438  852 1808 88 12  0  0
...

The columns we are most interested in are "si" and "so" which are abbreviations for "swap in" and "swap out". You can interpret them this way:

The above vmstat example shows a normal running system which then has a very, very large memory using application start up.

Swap Cache

The swap cache is very similar in concept to the page cache. A page of user application data written to disk is very similar to a page of file data on the disk. Any time a page is read in from swap ("si" in vmstat), it is placed in the swap cache. Just like the page cache, this is a bet on the kernel's part. It is betting that we might need to swap this page out _again_. If that need arises, we can detect that there is already a copy on the disk and simply throw the page in memory away immediately. This saves us the cost of re-writing the page to the disk.

The swap cache is really only useful when we are reading data from swap and never writing to it. If we write to the page, the copy on the disk is no longer in sync with the copy in memory. If this happens, we have to write to the disk to swap the page out again, just like we did the first time. However, the cost of saving _any_ writes to disk is great, and even with only a small portion of the swap cache ever written to, the system will perform better.

Dirty Writeout

Another operation that occurs when we start to run out of memory is the writing of dirty ("Dirty:" from meminfo) data to disk. Dirty data is page cache to which a write has occurred. Before we can free that page cache, we must first update the original copy on disk with the data from the write. As the system dips below its min_free_kbytes value, the system will attempt to free page cache. When freeing page cache, it is very common to find such dirty pages and the kernel will initiate these writes whenever it finds these pages. You can see this happening when "Dirty:" decreases at the same time as "bo" (Blocks written Out) from vmstat goes up.

The kernel may request that many pages be written to the disk in parallel. This speeds disk operations up by batching them together, or spanning them across several disks. When the kernel is actively trying to update on-disk data for a page, it will increment the meminfo "Writeback:" entry for that page.

The "sync" command will force all dirty data to be written out and "Dirty:" to drop to a very low value momentarily.

---

1. except for O_DIRECT 2. Need a link explaining high/lowmem

---

Input from John Stultz, Dave Kleikamp Tim Pepper

LinuxMM: Low_On_Memory (last edited 2008-08-01 16:22:00 by pool-71-117-247-149)