When using network attached storage, Linux and other OSes are vulnerable to a fundamental deadlock problem. The premise of the problem is that one may need memory in order to free memory, and freeing memory is done mostly when there is very little free memory left.
The problem can happen with normally written files, MAP_SHARED mmaped files and swap, but is most likely to happen with the latter two. It can hit with NFS, iSCSI, AoE or any other network storage protocol, since they all rely on the network layer.
The bug can get triggered as follows:
writing pages back to storage over network requires allocating memory
kswapd can get this memory, because it has higher priority and the system sets aside something extra for the pageout code
now the system is really low on memory, it may even have no free memory left at all/!\
the NAS appliance receives the write request from the computer
however, at this point the OS may not have any memory left to receive these packets from the NAS
the OS never knows whether the I/O has completed, since it cannot receive any more network packets
Note that locally attached disks do not have this deadlock because Linux has reserves of buffer heads and other data structures needed to start disk I/O. Using these reserves the system can pull itself away from deadlock when normal memory allocations would fail.
Proposal for a solution
This solution is built around two concepts:
IP networks are lossy anyway, so we can throw away non-critical packets;
we can use a reserved memory pool to avoid such deadlocks, provided the memory pool is only used for the right network traffic.
We can identify what network traffic should and should not be able to use these mempools by setting a special flag on the memory critical network sockets, eg. SOCK_MEMALLOC.
At package send time, if the normal memory allocation fails and the current socket is flagged SOCK_MEMALLOC, we can allocate the network buffers from the memory pool reserved for this situations. The network buffer needs to be flagged so that, when it is freed, it goes back into this pool.
Package receive time is harder, since at the time a packet is received we do not yet know for which socket it is. Once memory runs out we will have to do an allocation from the reserved memory pool for any incoming packet. However, networking is lossy. This means that when we (later) find out that the packet was not for one of the SOCK_MEMALLOC sockets, we can just drop it and pretend we never received it. The sending host will retry it, so everything will be fine.
Dropping packets for non-SOCK_MEMALLOC sockets may need some modifications to certain parts of the network stack, but if it makes it possible to run Linux hosts stably from just iSCSI or NFS root, that is well worth the hassle IMHO...
Daniel Phillips has a patch available that implements a lot of what's needed.
Potential problems with this solution
Know a solution or workaround to any of these problems? Please tell riel(at)surriel(dot)com know or edit this page directly.
-
the protocol could be multiplexed over one TCP/IP connection
may be problematic if there is so much traffic that the swap/VM IO can be drowned in other traffic
not a problem? other IO can complete on the same socket during our swap IO, but we already have the memory allocated on which we do that other IO
problem? what if there is a protocol that needs us to allocate memory to process other incoming data, say block invalidations?
Network I/O protocol enhancements
The life of operating systems could potentially be made easier with some protocol enhancements:
the client can tell the server "I am out of memory, send me ACKs only" to avoid having to process megabytes of in-progress read I/O while out of memory

