Just for my own sake, I'm calling every "memory pool"/zone/"resource group" a container here. = Requirements = * A container must have some limit on the amount of memory it may use. * Limits must apply to unmapped page cache, anonymous memory, and shared memory. * "Use" can be easily determined by the allocating container. It is harder to determine which container is _currently_ or most actively using a page. * The overhead in storage, processing time, and dedicated lines of code to the greater kernel should be minimized * Memory which is private to the container (say, anonymous memory) must be strictly accounted to that container = Icing on the Cake = * Limits would be nice to have on kernel structures * Overcommitting memory should be allowed. We should not allow memory on a system to go completely unused. * Memory for files may be accounted to either the container or a shared pool * Some care should be taken to ensure that a container may not abuse this shared pool * It is preferable to actually determine when sharing is "actually" occurring, but approximate metrics should be OK. This requirement is very secondary to any overhead which it might exhibit. * One useful idea would be being able to bind a directory hierarchy to a particular memory container. So you could e.g. assign all of /lib to the "common" container. * Should allow runtime flexibility in size and number of containers * We should be able to change limits easily * We should be able to create and destroy them easily to satisfy the needs of application containers (not all containers are long-lived) * A task might want to change containers at runtime. This might be a database or web server which wants to "do work" for a particular set of users, but doesn't want to go through the overhead of starting a whole new instance. == Software Zones == Use the existing Linux zone model to create sets of contiguous memory. Each of these is a subset of a current 'struct zone'. Each container gets one or more of these zones from which to allocate its pages. Pages shared between containers will be placed in centralized, "shared" zones. This code's use of the existing Linux structures would let it do things like page reclaim with the existing algorithms. This can also be done with the existing fake NUMA and cpusets support, without substantial kernel changes. However, each page still needs a page to "software zone" lookup mechanism, at least for returning the page to the proper allocator lists on free_page(). The nice part is that we already have a page to 'struct zone' lookup via each node's node_zones[] array. However, substantially increasing the number of zones will substantially increase the number of bits in page->flags needed to do proper lookups. It may also become infeasible to use a simple array in the node for these lookups. == Static Page Ownership (the classic CKRM way among others) == Add a pointer to 'struct page', and point it to an object that represents the container which caused the page's allocation. Don't change this until the page gets freed. Any other users of this page don't get charged for it. == Partial Page Ownership (Beancounters????) == Make sure that any additional users get charged, even if they are not the "first" user. Multiple users in a single container should not be charged multiple times. Overhead of figuring this out exactly could be more costly than other approaches. ---- || || Software Zones || Static Page Ownership || Partial Page Ownership || || enforces memory limits || || || || || code overhead || || || || || storage overhead || ||Extra 'struct page' field ||At least the cost of 'static' page ownership || || runtime overhead || || || || || resize at runtime || Physical contiguity requirement will inhibit growth. But if you have lots of small zones, and allow several to be assigned to a single container, you can resize reasonably easily || || || || creation at runtime ||Must find physically contiguous area to use, can not simply take a bit from each existing container || || || || recognize page sharing ||requires a "shared" zone ||doesn't recognize use by multiple containers, but could have a "shared" container || || || support overcommit || Overcommit is trickier because of the static assignment of zones to containers. But with a few minor hooks in the kernel (directed reclaim, and OOM notifications) it's possible for userspace to juggle the zone assignments to wherever they're needed, allowing overcommit || || || || vulnerable to DOS attack || No charge for shared data access means any container unusing something can not cause another to go over its limit|| Stop using shared data at opportune time to force another container over its limit.|| Same as static scheme||