When using network attached storage, Linux and other OSes are vulnerable to a fundamental deadlock problem. The premise of the problem is that one may need memory in order to free memory, and freeing memory is done mostly when there is very little free memory left.
The problem can happen with normally written files, MAP_SHARED mmaped files and swap, but is most likely to happen with the latter two. It can hit with NFS, iSCSI, AoE or any other network storage protocol, since they all rely on the network layer.
Note that locally attached disks do not have this deadlock because Linux has reserves of buffer heads and other data structures needed to start disk I/O. Using these reserves the system can pull itself away from deadlock when normal memory allocations would fail.
Proposal for a solution
At package send time, if the normal memory allocation fails and the current socket is flagged SOCK_MEMALLOC, we can allocate the network buffers from the memory pool reserved for this situations. The network buffer needs to be flagged so that, when it is freed, it goes back into this pool.
Package receive time is harder, since at the time a packet is received we do not yet know for which socket it is. Once memory runs out we will have to do an allocation from the reserved memory pool for any incoming packet. However, networking is lossy. This means that when we (later) find out that the packet was not for one of the SOCK_MEMALLOC sockets, we can just drop it and pretend we never received it. The sending host will retry it, so everything will be fine.
Dropping packets for non-SOCK_MEMALLOC sockets may need some modifications to certain parts of the network stack, but if it makes it possible to run Linux hosts stably from just iSCSI or NFS root, that is well worth the hassle IMHO...
Daniel Phillips has a patch available that implements a lot of what's needed.