Just for my own sake, I'm calling every "memory pool"/zone/"resource group" a container here. = Requirements = * Preserve current kernel global optimizations * Sharing is ''good'' to the system overall * A container must have some limit on the amount of memory it may use. * Limits must apply to unmapped page cache, anonymous memory, and shared memory. * "Use" can be easily determined by the allocating container. It is harder to determine which container is _currently_ or most actively using a page. * The overhead in storage, processing time, and dedicated lines of code to the greater kernel should be minimized * Memory which is private to the container (say, anonymous memory) must be strictly accounted to that container * Must scale to large numbers of CPUs and in a NUMA environment = Icing on the Cake = * Limits would be nice to have on kernel structures * But, those limits should be imposed in ways that userspace can grasp. For instance, limit the number of fds, not the number of bytes needed for the kernel's 'struct files' * Overcommitting memory should be allowed. We should not allow memory on a system to go completely unused. * Memory for files may be accounted to either the container or a shared pool * Some care should be taken to ensure that a container may not abuse this shared pool * It is preferable to actually determine when sharing is "actually" occurring, but approximate metrics should be OK. This requirement is very secondary to any overhead which it might exhibit. * One useful idea would be being able to bind a directory hierarchy to a particular memory container. So you could e.g. assign all of /lib to the "common" container. * Should allow runtime flexibility in size and number of containers * We should be able to change limits easily * We should be able to create and destroy them easily to satisfy the needs of application containers (not all containers are long-lived) * A task might want to change containers at runtime. This might be a database or web server which wants to "do work" for a particular set of users, but doesn't want to go through the overhead of starting a whole new instance. * Perfect Precision * Having every byte accounted for is nice, but having a little bit of fuzziness is OK if it makes the problem easier to solve. This is, of course, unless that fuzziness can be exploited in a systematic way to get around any limits. == Software Zones == Use the existing Linux zone model to create sets of contiguous memory. Each of these is a subset of a current 'struct zone'. Each container gets one or more of these zones from which to allocate its pages. Pages shared between containers will be placed in centralized, "shared" zones. This code's use of the existing Linux structures would let it do things like page reclaim with the existing algorithms. This can also be done with the existing fake NUMA and cpusets support, without substantial kernel changes. However, each page still needs a page to "software zone" lookup mechanism, at least for returning the page to the proper allocator lists on free_page(). The nice part is that we already have a page to 'struct zone' lookup via each node's node_zones[] array. However, substantially increasing the number of zones will substantially increase the number of bits in page->flags needed to do proper lookups. It may also become infeasible to use a simple array in the node for these lookups. == Static Page Ownership (the classic CKRM way among others) == Add a pointer to 'struct page', and point it to an object that represents the container which caused the page's allocation. Don't change this until the page gets freed. Any other users of this page don't get charged for it. == Partial Page Ownership (Beancounters????) == Make sure that any additional users get charged, even if they are not the "first" user. Multiple users in a single container should not be charged multiple times. Overhead of figuring this out exactly could be more costly than other approaches. == Only Count RSS == In this scenario, we only count a container's mapped pages. All of the accounting can be done with existing data structures (the rmap lists). When a process goes over its limits, the existing page reclaim algorithm can be used, with a modification to preferentially look for pages mapped by the container over its limit. The overhead here comes by looking at the rmap lists at map and unmap time to see if this use is the first or last for a container. The big disadvantage to this approach is that it ignores things that aren't mapped. ---- || || Software Zones || Static Page Ownership || Partial Page Ownership || Only Count RSS || || enforces comprehensive memory limits |||| |||| doesn't account for page cache, can not be extended to cover non-user-mapped memory use || || code overhead |||| || || || || storage overhead ||||Extra 'struct page' field ||At least the cost of 'static' page ownership || || || runtime overhead |||||||| walking the rmap chains might get expensive || || resize at runtime || Physical contiguity requirement will inhibit growth. But if you have lots of small zones, and allow several to be assigned to a single container, you can resize reasonably easily |||||||| || creation at runtime ||Must find physically contiguous area to use, can not simply take a bit from each existing container |||| |||| || recognize page sharing ||requires a "shared" zone ||doesn't recognize use by multiple containers, but could have a "shared" container |||||| || support overcommit || Overcommit is trickier because of the static assignment of zones to containers. But with a few minor hooks in the kernel (directed reclaim, and OOM notifications) it's possible for userspace to juggle the zone assignments to wherever they're needed, allowing overcommit |||||| || || vulnerable to DOS attack || No charge for shared data access means any container unusing something can not cause another to go over its limit || Stop using shared data at opportune time to force another container over its limit. || Same as static scheme ||containers get no credit for sharing, so no penalty when sharing goes away ||