When using network attached storage, Linux and other OSes are vulnerable to a fundamental deadlock problem. The premise of the problem is that one may need memory in order to free memory, and freeing memory is done mostly when there is very little free memory left.
The problem can happen with normally written files, MAP_SHARED mmaped files and swap, but is most likely to happen with the latter two. It can hit with NFS, iSCSI, AoE or any other network storage protocol, since they all rely on the network layer.
The bug can get triggered as follows:
- the system is low on free memory
- as a result, kswapd starts trying to free up memory by evicting pages
- if the pages are dirty, they have to be written back to storage
- writing pages back to storage over network requires allocating memory
- memory for packet headers (if the NIC can assemble the packet itself)
- memory for entire packets
- kswapd can get this memory, because it has higher priority and the system sets aside something extra for the pageout code
now the system is really low on memory, it may even have no free memory left at all/!\
- the NAS appliance receives the write request from the computer
- the NAS appliance sends back an ACK packet acknowledging that the data was received
- the NAS appliance sends back a packet acknowledging the OS that the data was written to disk
- however, at this point the OS may not have any memory left to receive these packets from the NAS
- the OS never knows whether the I/O has completed, since it cannot receive any more network packets
- even if it can still receive packets, memory could be filled up with packets from other connections/!\
- the computer deadlocks
Note that locally attached disks do not have this deadlock because Linux has reserves of buffer heads and other data structures needed to start disk I/O. Using these reserves the system can pull itself away from deadlock when normal memory allocations would fail.
Proposal for a solution
This solution is built around two concepts:
- IP networks are lossy anyway, so we can throw away non-critical packets;
- we can use a reserved memory pool to avoid such deadlocks, provided the memory pool is only used for the right network traffic.
We can identify what network traffic should and should not be able to use these mempools by setting a special flag on the memory critical network sockets, eg. SOCK_MEMALLOC.
At package send time, if the normal memory allocation fails and the current socket is flagged SOCK_MEMALLOC, we can allocate the network buffers from the memory pool reserved for this situations. The network buffer needs to be flagged so that, when it is freed, it goes back into this pool.
Package receive time is harder, since at the time a packet is received we do not yet know for which socket it is. Once memory runs out we will have to do an allocation from the reserved memory pool for any incoming packet. However, networking is lossy. This means that when we (later) find out that the packet was not for one of the SOCK_MEMALLOC sockets, we can just drop it and pretend we never received it. The sending host will retry it, so everything will be fine.
Dropping packets for non-SOCK_MEMALLOC sockets may need some modifications to certain parts of the network stack, but if it makes it possible to run Linux hosts stably from just iSCSI or NFS root, that is well worth the hassle IMHO...
Daniel Phillips has a patch available that implements a lot of what's needed.
Potential problems with this solution
Know a solution or workaround to any of these problems? Please tell riel(at)surriel(dot)com know or edit this page directly.
- fragments
- most OSes send back-to-front
- you need to have all the fragments of a packet before you know whether or not you can discard them
- this could be quite a bit of memory
possible solution: if we just allocated the last buffer from the mempool, received a fragment and the packet is not yet complete, we drop all fragments of this packet
workaround: use smaller packets to/from your NAS box
- layered/multiplexed protocols
- the protocol could be multiplexed over one TCP/IP connection
- may be problematic if there is so much traffic that the swap/VM IO can be drowned in other traffic
- iSCSI can have this problem
not a problem? other IO can complete on the same socket during our swap IO, but we already have the memory allocated on which we do that other IO
problem? what if there is a protocol that needs us to allocate memory to process other incoming data, say block invalidations?
- needs mempool for such protocol handling?
- unfixable?
- encrypted traffic
- you may need out-of-band traffic (eg. key exchange) before you can receive the ACK
- possibly even renegotiation in userspace
- these events do not happen very often, so maybe can be ignored initially?
- DHCP
- same problems as encrypted traffic
Network I/O protocol enhancements
The life of operating systems could potentially be made easier with some protocol enhancements:
- the client can tell the server "I am out of memory, send me ACKs only" to avoid having to process megabytes of in-progress read I/O while out of memory