When using network attached storage, Linux and other OSes are vulnerable to a fundamental deadlock problem. The premise of the problem is that one may need memory in order to free memory, and freeing memory is done mostly when there is very little free memory left. The problem can happen with normally written files, MAP_SHARED mmaped files and swap, but is most likely to happen with the latter two. It can hit with NFS, iSCSI, AoE or any other network storage protocol, since they all rely on the network layer. The bug can get triggered as follows: * the system is low on free memory * as a result, kswapd starts trying to free up memory by evicting pages * if the pages are dirty, they have to be written back to storage * writing pages back to storage over network requires allocating memory * memory for packet headers (if the NIC can assemble the packet itself) * memory for entire packets * kswapd can get this memory, because it has higher priority and the system sets aside something extra for the pageout code * now the system is ''really'' low on memory, it may even have no free memory left at all/!\ * the NAS appliance receives the write request from the computer * the NAS appliance sends back an ACK packet acknowledging that the data was received * the NAS appliance sends back a packet acknowledging the OS that the data was written to disk * however, at this point the OS may not have any memory left to receive these packets from the NAS * the OS never knows whether the I/O has completed, since it cannot receive any more network packets * even if it can still receive packets, memory could be filled up with packets from other connections/!\ * the computer deadlocks Note that locally attached disks do not have this deadlock because Linux has reserves of buffer heads and other data structures needed to start disk I/O. Using these reserves the system can pull itself away from deadlock when normal memory allocations would fail. == Proposal for a solution == This solution is built around two concepts: * IP networks are lossy anyway, so we can throw away non-critical packets; * we can use a reserved memory pool to avoid such deadlocks, provided the memory pool is only used for the right network traffic. We can identify what network traffic should and should not be able to use these mempools by setting a special flag on the memory critical network sockets, eg. SOCK_MEMALLOC. At package send time, if the normal memory allocation fails and the current socket is flagged SOCK_MEMALLOC, we can allocate the network buffers from the memory pool reserved for this situations. The network buffer needs to be flagged so that, when it is freed, it goes back into this pool. Package receive time is harder, since at the time a packet is received we do not yet know for which socket it is. Once memory runs out we will have to do an allocation from the reserved memory pool for any incoming packet. However, networking is lossy. This means that when we (later) find out that the packet was not for one of the SOCK_MEMALLOC sockets, we can just drop it and pretend we never received it. The sending host will retry it, so everything will be fine. Dropping packets for non-SOCK_MEMALLOC sockets may need some modifications to certain parts of the network stack, but if it makes it possible to run Linux hosts stably from just iSCSI or NFS root, that is well worth the hassle IMHO...