converted to 1.6 markup
|Deletions are marked like this.||Additions are marked like this.|
|Line 58:||Line 58:|
|They're harmless. But, on the surface, might manifest as a kernel memory leak. To be sure, try [:Drop_Caches:this procedure] for dentries and inodes. If the numbers of task_struct and proc_inode_cache objects decrease, then there's no real bug.||They're harmless. But, on the surface, might manifest as a kernel memory leak. To be sure, try [[Drop_Caches|this procedure]] for dentries and inodes. If the numbers of task_struct and proc_inode_cache objects decrease, then there's no real bug.|
It's a common dilemma: I just got a brand new Linux machine, loaded it up with lots of expensive RAM, and left it for a day. Now, it's out of memory, or it is swapping! It definitely has enough RAM, so there must be a bug in Linux!I assure you, in almost all cases, your system has plenty of RAM. However, where is all of that RAM going? For what does Linux use all of it? How can I actually tell that I'm out of RAM? Unfortunately, Linux can make these very hard questions to answer. This article will explain in detail many of the ways that Linux uses RAM for things other than user data and how you can tell when your system is _actually_ out of RAM.
Linux has this basic rule: a page of free RAM is wasted RAM. RAM is used for a lot more than just user application data. It also stores data for the kernel itself and, most importantly, can mirror data stored on the disk for super-fast access. These in-memory mirrors are very important because of how much faster RAM is to access than the disk. Ever notice how long it takes to start up a web browser the first time after your system boots? Have you ever loaded it a second time and had it pop up almost immediately? The greatly reduced start time is because of these in-memory mirrors of on-disk data. These mirrors take several forms in Linux, so let's examine each of them. These data are largely enumerated in the /proc/meminfo file, and we will refer to its contents regularly.
Here is the output from the author's 2GB laptop running kernel version 2.6.20:
MemTotal: 2073564 kB MemFree: 1259628 kB Buffers: 27924 kB Cached: 176764 kB SwapCached: 285188 kB Active: 562120 kB Inactive: 145592 kB HighTotal: 1179008 kB HighFree: 562948 kB LowTotal: 894556 kB LowFree: 696680 kB SwapTotal: 1992052 kB SwapFree: 1167632 kB Dirty: 9052 kB Writeback: 0 kB AnonPages: 437520 kB Mapped: 49800 kB Slab: 91332 kB SReclaimable: 64816 kB SUnreclaim: 26516 kB PageTables: 4872 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 3028832 kB Committed_AS: 2402708 kB VmallocTotal: 114680 kB VmallocUsed: 6112 kB VmallocChunk: 108044 kB
Your /proc/meminfo file may contain different entries than this. The kernel developers have been slowly adding to this file over the years and it has grown. Some distributions also add their own custom entries to this file. Do not worry if your file differs slightly from this one.
The Linux Page Cache ("Cached:" from meminfo ) is the largest single consumer of RAM on most systems. Any time you do a read() from a file on disk, that data is read into memory, and goes into the page cache(1.). After this read() completes, the kernel has the option to simply throw the page away since it is not being used. However, if you do a second read of the same area in a file, the data will be read directly out of memory and no trip to the disk will be taken. This is an incredible speedup and is the reason why Linux uses its page cache so extensively: it is betting that after you access a page on disk a single time, you will soon access it again.
The same is true for mmap()'d files ("Mapped:" in meminfo). The first time the mmap()'d area is accessed, the page will be brought in from the disk and mapped into memory. The kernel could choose to immediately discard that page after the instruction that touched the page has completed. However, the kernel makes the same bets that it did for simple read()s of the file. It keeps the page mapped into memory and bets that you will soon access it again. This behavior can manifest in confusing ways.
It might be assumed that mmap()'d memory is not "cached" because it is in active use and that "cached" means "completely unused right now". However, Linux does not define it that way. The Linux definition of "cached" is closer to "this is a copy of data from the disk that we have here to save you time". It implies nothing about how the page is actually being used. This is why we have both "Cached:" and "Mapped:" in meminfo. All "Mapped:" memory is "Cached:", but not all "Cached:" memory is "Mapped:".
Each time you do an 'ls' (or any other operation: open(), stat(), etc...) on a filesystem, the kernel needs data which are on the disk. The kernel parses these data on the disk and puts it in some filesystem-independent structures so that it can be handled in the same way across all different filesystems. In the same fashion as the page cache in the above examples, the kernel has the option of throwing away these structures once the 'ls' is completed. However, it makes the same bets as before: if you read it once, you're bound to read it again. The kernel stores this information in several "caches" called the dentry and inode caches. dentries are common across all filesystems, but each filesystem has its own cache for inodes. You can view the different caches and their sizes by executing this command:
head -2 /proc/slabinfo; cat /proc/slabinfo | egrep dentry\|inode
(this ram is a component of "Slab:" in meminfo)
Older kernels (around 2.6.9) left some structures in the slab cache for longer than new kernels do. That means that even though they may have been quite unused, they're left until there is memory pressure. This happens especially with proc_inodes. /proc inodes also happen to pin task_structs, which means that each one can effectively occupy over 2 KBytes of RAM. This RAM won't show up as 'Cached' and may appear to be a kernel memory leak. On a system with only 100 tasks (with little memory pressure) there can be hundreds of thousands of these left laying around.
They're harmless. But, on the surface, might manifest as a kernel memory leak. To be sure, try this procedure for dentries and inodes. If the numbers of task_struct and proc_inode_cache objects decrease, then there's no real bug.
The buffer cache ("Buffers:" in meminfo) is a close relative to the dentry/inode caches. The dentries and inodes in memory represent structures on disk, but are laid out very differently. This might be because we have a kernel structure like a pointer in the in-memory copy, but not on disk. It might also happen that the on-disk format is a different endianness than CPU.
In any case, when we need to fetch an inode or dentry to populate the caches, we must first bring in a the page from disk on which those structures are represented. This can not be a part of the page cache because it is not actually the contents of a file, rather it is the raw contents of the disk. A page in the buffer cache may have dozens of on-disk inodes inside of it, although we only created an in-memory inode for one. The buffer cache is, again, a bet that the kernel will need another in the same group of inodes and will save a trip to the disk by keeping this buffer page in memory.
Running out of memory
Now that you know all of these wonderful uses for your free RAM, you have to be wondering: what happens when there is no more free RAM? If I have no memory free, and I need a page for the page cache, inode cache, or dentry cache, where do I get it?
First of all the kernel tries not to let you get close to 0 bytes of free RAM. This is because, to free up RAM, you usually need to allocate more. Have you ever gone to start a large project at your desk, and realized that you needed to clear up an area before going to work? The kernel needs the same kind of "working space" for its own housekeeping.
Based on the amount of RAM (and a few other factors), the kernel comes up with a heuristic for the amount of memory that it feels comfortable with as its working space (this watermark is exposed to userspace in /proc/sys/vm/min_free_kbytes). This watermark is interpreted and spread out among the different areas of memory on the system. When it reaches this watermark for any one area, the kernel starts to reclaim memory from the different uses described above.
SwapTotal: 1992052 kB SwapFree: 1167632 kB
When the kernel decides not to get memory from any of the other sources we've described so far, it starts to swap. During this process it takes user application data and writes it to a special place (or places) on the disk. You might think that this should only be a last resort once we are completely unable to free any of the other types of RAM. However, the kernel does not do it this way. Why?
Consider an application like /sbin/init. It has some incredibly important duties like setting up the system at startup and respawning login prompts if they die. But, how much of its data is actually used during the normal runtime of the system? If the system is at its limits and just about out of RAM, should we swap out a completely unused since bootup page of /sbin/init's data and use that page for page cache? Or should we keep /sbin/init completely in memory and force the potential page cache user to go to the disk?
The kernel will often choose to swap out /sbin/init's data in favor of the current needs of the currently running applications. For this reason, even a system with vast amounts of RAM (even when properly tuned) can swap. There are lots of pages of memory which are user application data, but are rarely used. All of these are targets for being swapped in favor of other uses for the RAM.
But, if the mere presence of used swap is not evidence of a system which has too little RAM for its workload, what is? As you can see, swap is most efficiently used for data which will not be accessed for a long time. If data in swap is being constantly accessed, then it is failing to be used effectively. We can monitor the amount of data going in and out of swap with the vmstat command. The following will produce output every 5 seconds:
$ vmstat 5 procs -----------memory---------- ---swap--- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 3 0 833704 54824 25196 328672 10 0 343 18 510 1382 96 4 0 0 6 0 833704 54556 25092 324584 0 0 333 22 504 1180 93 7 0 0 4 0 833704 51516 25112 320856 33 0 315 19 508 1234 95 5 0 0 3 0 833704 54836 24984 314404 6 0 223 27 498 1191 95 5 0 0 3 0 833704 53072 24944 307844 4 0 216 22 518 1375 96 4 0 0 5 0 833704 53928 24888 304076 6 0 262 18 548 1665 94 6 0 0 3 4 843964 50192 184 58064 16 2416 16 2464 570 1451 78 22 0 0 3 7 908244 48756 224 47760 118 13645 149 13664 730 1245 76 16 0 8 3 2 922064 54280 340 49228 1470 2838 1817 2865 711 1481 88 12 0 0 4 2 932644 54068 424 52204 1972 2195 2596 2211 678 1388 90 10 0 0 2 3 944012 56304 492 52292 2986 2591 3063 2615 735 1562 89 11 0 0 2 4 957304 54604 572 51964 4042 3414 4096 3438 852 1808 88 12 0 0 ...
The columns we are most interested in are "si" and "so" which are abbreviations for "swap in" and "swap out". You can interpret them this way:
- A small "swap in" and "swap out" is normal, and indicates that there is little need for application data which is currently in swap, and that any new memory needs are being dealt with with means other than swapping application data.
- A large "swap in" with a small "swap out" is usually indicative that a swapped out application is now starting to run again, and needs to get its data back from the disk.
- A large "swap out" with a small "swap in" is usually indicative that an application is in need of some kind RAM (could be any of the caches, or application data), and is swapping out old application data in order to get that RAM.
- A large "swap out" with a large "swap in" is generally the condition which you want to avoid. It means that the system is "thrashing" or that it is needing new RAM as fast as it can swap out application data. This often means that the application needing the RAM has forced out all of the truly old data and has started to force data in active use out to swap. Those "active use data" will be immediately read back in from swap, causing both "swap in" and "swap out" to be elevated, and roughly equal.
The above vmstat example shows a normal running system which then has a very, very large memory using application start up.
The swap cache is very similar in concept to the page cache. A page of user application data written to disk is very similar to a page of file data on the disk. Any time a page is read in from swap ("si" in vmstat), it is placed in the swap cache. Just like the page cache, this is a bet on the kernel's part. It is betting that we might need to swap this page out _again_. If that need arises, we can detect that there is already a copy on the disk and simply throw the page in memory away immediately. This saves us the cost of re-writing the page to the disk.
The swap cache is really only useful when we are reading data from swap and never writing to it. If we write to the page, the copy on the disk is no longer in sync with the copy in memory. If this happens, we have to write to the disk to swap the page out again, just like we did the first time. However, the cost of saving _any_ writes to disk is great, and even with only a small portion of the swap cache ever written to, the system will perform better.
Another operation that occurs when we start to run out of memory is the writing of dirty ("Dirty:" from meminfo) data to disk. Dirty data is page cache to which a write has occurred. Before we can free that page cache, we must first update the original copy on disk with the data from the write. As the system dips below its min_free_kbytes value, the system will attempt to free page cache. When freeing page cache, it is very common to find such dirty pages and the kernel will initiate these writes whenever it finds these pages. You can see this happening when "Dirty:" decreases at the same time as "bo" (Blocks written Out) from vmstat goes up.
The kernel may request that many pages be written to the disk in parallel. This speeds disk operations up by batching them together, or spanning them across several disks. When the kernel is actively trying to update on-disk data for a page, it will increment the meminfo "Writeback:" entry for that page.
The "sync" command will force all dirty data to be written out and "Dirty:" to drop to a very low value momentarily.
1. except for O_DIRECT 2. Need a link explaining high/lowmem
Input from John Stultz, Dave Kleikamp Tim Pepper