An OOM (Out Of Memory) error is what happens when the kernel runs out of memory. It basically starts killing random processes, and spits a lot of logging into dmesg.

How do I debug an OOM?

Read this page. Look at all the causes of OOM events, and try to figure out into which of the listed causes your OOM falls. Remember, very few OOM events are genuine kernel bugs. Virtually all of them are user applications which are behaving badly.

What leads up to an OOM?

Generally, the system is lazy about reclaiming memory, preferring that it lay about in caches until there is a genuine need. So it's not unusual to see memory usage grow and not shrink if there are no requests for memory. When a request comes in, the system may choose to release some memory that nobody is using to satisfy the request, or it may place data that is still in use out on swap space, and hand over the now available memory. If that data on swap space is ever needed again, it will displace some other piece of disused memory. An OOM actually occurs when this process of replacing things is thought to have stopped making progress.

If things get tight, whole processes are killed on the theory that that will free up gobs of memory. This is not a completely desirable solution, but it does (in theory) allow the system to keep running. In practice, however, people usually object to any of their processes being involuntarily terminated, and this is usually the point at which the problem comes to us.

What causes these OOM events?

awk '{printf "%5d MB %s\n", $3*$4/(1024*1024), $1}' < /proc/slabinfo | sort -n

Run [ this script] during your test, and the OOM. Run the script, send the output to a VM expert. Have them parse it. Then come back and update this page. ;)

LinuxMM: OOM (last edited 2009-03-16 19:40:27 by pool-74-107-144-220)