VirtualMemory - linux-mm.org Wiki

The term "Virtual Memory" is used to describe a method by which the physical RAM of a computer is not directly addressed, but is instead accessed via an indirect "lookup". On the Intel platform, paging is used to accomplish this task.

Paging, in CPU specific terms, should not be confused with swap. These terms are related, but paging is used to refer to virtual to physical address translation. The author encourages readers to find the Intel Manuals online or order them in print for a deeper understanding of the Intel paging system. (Note - In Intel documents, the term virtual address, as used in the kernel code, is replaced with linear address).

To accomplish address translation (paging) the CPU needs to be told:

a) where to find the address translation information. This is accomplished by pointing the CPU to a lookup table called a 'page table'. b) to activate paging mode. This is accomplished by setting a specific flag in a control register.

Kernel use of virtual memory begins very early on in the boot process. head.S contains code to create provisional page tables and get the kernel up and running, however that is beyond this overview.

Every physical page of memory up to 896MB is mapped directly into the kernel space. Memory greater than 896MB (High Mem) is not permanently mapped, but is instead temporarily mapped using kmap and kmap_atomic (see HighMemory).

The descriptions of virtual memory will be broken into two distinct sections; kernel paging and user process paging.

Kernel Initialization:

Paging is initialized in arch/i386/mm/init.c. The function 'paging_init()' is called once by setup_arch during kernel initialization. It immediately calls pagetable_init(). pagetable_init() starts by defining the base of the page table directory:

 *pgd_base = swapper_pg_dir;

swapper_pg_dir is defined in head.S, using .org directives (.org allows structures to be placed in desired memory locations). It points to 0x1000 above the 'root' of kernel memory. Kernel memory is defined to start at PAGE_OFFSET,which in x86 is 0XC0000000, or 3 gigabytes. (This is where the 3gig/1gig split is defined.) Every virtual address above PAGE_OFFSET is the kernel, any address below PAGE_OFFSET is a user address.

After some capability checking, pagetable_init() calls 'kernel_physical_mapping_init'. This function performs the lions share of the kernel page table setup.

Definitions:
pgd = Page Directory
pmd = Page Middle Directory
pte = Page Table (Entry)

This function performs the bulk of the kernel page table setup. By looping for each pmd and pte, the function calls one_md_table_init and one_page_table_init respectively. These functions create new page middle directories and page tables by allocating space using the boot memory allocator. In non-PAE mode (PAE, or Physical Addressing Extensions allows Intel architectures to address greater than 4 Gig), the pmd is not used and no memory is allocated. Here is the important part of one_page_table_init:

 *page_table = (pte_t*)alloc_bootmem_low_pages(PAGE_SIZE);
 set_pmd(pmd, __pmd(__pa(page_table) |_PAGE_TABLE ));

The first line allocates a page of memory to hold the table using the bootmem allocator, the next inserts the table into the pmd.

Once the table is returned, kernel_physical_mapping_init populates it the page table using code similar to this:

 set_pte(pte, pfn_pte(pfn, PAGE_KERNEL))

This code populates the page tables in a linear fashion. What I mean to say is the mapping from physical page number to virtual addressis linear and only differs by PAGE_OFFSET. To translate a physical address to a virtual address, one only needs to add PAGE_OFFSET(0XC0000000). This can be seen in the macro va from page.h:

#define __va(x)                 ((void *)((unsigned long)(x)+ PAGE_OFFSET))

The virtual address of x is returned by adding PAGE_OFFSET.

Once the page tables have been set, pagetable_init() calls permanent_kmaps_init() to set up the page tables for use by kmap. Recall that we discussed the use of kmap to temporarily map high memory (>896MB) into the kernel as required. This function call sets the page tables for use by kmap.

Once all is set, the return is made back to paging_init(). On return, paging_init loads the new page table address to CR3, here:

load_cr3(swapper_pg_dir);

After flushing the TLB's to force a reload for our new page tables, kmap_init() is the last piece of the paging setup. It completes the setup of the kmap initialized above.

Kernel paging is active. Once paging is active, the kernel can address all physical memory (aside from HighMem) via linear addressing starting at PAGE_OFFSET (0xC0000000 in 3/1 split).

User Space Virtual Memory:

Every process in linux is able to address 4 gigabytes of linear address space. In a standard kernel config, the first 3 gigabytes (0x00000000 - 0xC0000000) are referred to as 'user space' and represent data, functions and the stack of user processes. The top 1 gigabyte (0xC0000000 - 0xFFFFFFFF) of memory is 'kernel space'. User processes typically do not have access to kernel memory space, and will normally not address this region.

Process virtual memory is handled using a number of internal structures. The first of interest is mm_struct:

http://sosdg.org/~coywolf/lxr/source/include/linux/sched.h?v=2.6.16-rc1#L293

mm_struct provides the top level management of a process' memory space. By referring to the link above, we see some important items:

struct vm_area_struct * mmap;

A list of vma structs (described later) that comprise the VM space of the process

pgd_t * pgd;

A pointer to the process page tables

unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;

And some familiar items that indicate the start and end of various user process sections (code, data, stack).

As we can see, the mm_struct maintains the overall picture of a process's memory profile. The page tables described above keep track of physical pages allocated to the process by the kernel. They may be in low or high memory. It is important to note that user page tables will directly map high pages, unlike the restriction imposed by HighMem on the kernel tables.

The detail of each virtual area in a user process is stored in vm_area_struct. The definition given in the kernel source is:

"... A VM area is any part of the process virtual memory space that has a special rule for the page-fault handlers (ie a shared library, the executable area etc)."

The structure can be seen here:

http://sosdg.org/~coywolf/lxr/source/include/linux/mm.h?v=2.6.15#L57

Each discrete area in a process virtual memory space has a vm_area_struct to describe, among other things, its start, end, mm_struct parent, permissions, file mapping informationand a number of "tree's" member pointers for fast searching of the vm space.

With these data structures, the kernel is able to manage memory for user processes. Allocating, freeing and moving/swapping (PageFaultHandling) of pages can occur with the data stored here.

For more information, the interested reader is directed to the main wiki pages here. Other good sources of information include Mel Gorman's book on the Linux Virtual Memory Manager, Understanding the Linux Kernel (O'Reilly) and Linux Device Drivers 3 (LDD3).

IRC convo on virtual memory:

<saxm> is everything from PAGE_OFFSET onwards paged?
<riel> saxm: depends, what do you mean by "paged" and what do you mean by "everything" ? ;))
* riel could find exceptions on either side of PAGE_OFFSET, depending on which meanings you want to use
<saxm> riel:  "paged" as in hardware paged by the cpu, "everything" meaning addressable memory
<riel> after bootup, all memory is accessed through the MMU
<ahu> riel, do you recall when current mainline 2.6.10 will decide not to cache a file?
<riel> so everything before and after PAGE_OFFSET is paged
<riel> not everything can be demand paged, though ...
<ahu> for example, when I do: open() seek() read() close()
<ahu> I seem to recall that sequential reads were special cased?
<saxm> riel:  but there's a difference between paging above and below PAGE_OFFSET?? Process pages below PAGE_OFFSET map to kernel pages above PAGE_OFFSET?
<riel> pages below PAGE_OFFSET belong to userspace
<riel> and can be demand paged
<riel> addresses above PAGE_OFFSET are kernel memory
<saxm> riel:  so there is no linear mapping between pages in virtual memory and consecutive area of physical memory?
<riel> there is a linear mapping for the first 900 MB of kernel memory
<riel> where physical address 0 - 896 MB is mapped into PAGE_OFFSET - PAGE_OFFSET+896MB
<Bertl> (depending on the split)
<saxm> riel:  ok, so there are 896*1024/4 physical frames addressable from PAGE_OFFSET->PAGE_OFFSET+896mb, and page directorys/tables map userspace page accesses to the appropriate page within this range?
<riel> saxm: no, userspace does not have access to the virtual memory beyond PAGE_OFFSET
<riel> saxm: userspace only gets access to virtual addresses below PAGE_OFFSET
<saxm> riel: just trying to understand how virtual pages relate to this mapped area of memory from PAGE_OFFSET to PAGE_OFFSET+896?
<riel> memory above PAGE_OFFSET is kernel virtual memory
<riel> part of it is a direct map of the first part of physical memory
<riel> but that same physical memory could also get virtual mappings from elsewhere, eg. userspace
<riel> or vmalloc
<riel> also, userspace and vmalloc can map physical memory from outside the 896MB of direct mapped memory (as well as inside it)
<saxm> riel:  ok, multiple mappings to physical pages, that clears things up for me!
<saxm> riel:  so how does it works for kernel memory? kernel memory allocations (for page tables etc...) must come out of that 896meg chunk too?
<riel> most kernel memory allocation needs to come from that 896 MB, indeed
<riel> though page tables are the big exception ;)
<saxm> riel:  which means they're resident in memory all the time - if that's where physical memory is mapped to?
<riel> kernel data structures are always resident
<saxm> riel:  so where do page tables reside? Surely not below PAGE_OFFSET? Somewhere above PAGE_OFFSET+896mb then?
<riel> they could reside anywhere
<saxm> anywhere from 0->4gb (on x86 with no pae)?
<maks> once it was recommended for lower latency by audio folks, it turns out that todays ext3 is for them the best bet too.
<maks> echan pardon
<riel> saxm: yeah
<riel> saxm: so it could be either inside the low 896MB, or in highmem (or some page tables in both - more likely)
<saxm> riel: and that 896meg chunk of physical memory addressed at PAGE_OFFSET, is also pagaeble right? So kernel allocations (not including page tables) just set some flag to disable paging on that page?
<riel> ummmmmmmmmm, they map physical memory
<riel> physical memory is, by definition, not pageable
<riel> the contents of those pages might be pageable though
<riel> so you could have a page P at physical address 400MB
<riel> a process (eg. mozilla) is using that page
<riel> at virtual address 120MB
<riel> somewhere in its heap
<riel> the contents of the physical page can be paged out, at which point mozilla's heap page at 120MB is paged out
<riel> but the kernel mapping (at PAGE_OFFSET + 400MB) still maps the same page P
<riel> just with different contents ;)
<saxm> riel: thanks for that very helpful example!

["CategoryLinuxMMInternals"]