Remappable memory
Drivers often implement mmap() to allow userspace to have direct access to memory that was allocated/reserved within kernel space. For example, you may wish to allow userspace to have direct access to a kernel-allocated buffer that is used for DMA with a PCI device. LDD3 chapter 15 provides a decent introduction to this topic. In summary, LDD3 explains that you can either remap kernel buffers into userspace by calling remap_pfn_range() from your driver's mmap handler, or you can set up a nopage VM handler to remap on a page-by-page basis.
Physical addresses vs struct page pointers
LDD3 does not explicitly discuss one important difference between remap_pfn_range() and nopage: remap_pfn_range operates on physical addresses, and nopage operates on page structure pointers. This is significant because not all kinds of memory can be represented by page structure pointers - you cannot use nopage in certain scenarios. There is an LWN article mentioning this limitation:
- Meanwhile, one of the longstanding limitations of nopage() is that it can only handle situations where the relevant physical memory has a corresponding struct page. Those structures exist for main memory, but they do not exist when the memory is, for example, on a peripheral device and mapped into a PCI I/O memory region. [...] In such cases, drivers must explicitly map the memory into user space with remap_pfn_range() instead of using nopage().
Another very common scenario where nopage cannot be used is when you are trying to remap a buffer that was allocated by kmalloc(). You may be tempted to call virt_to_page(addr) to get a struct page pointer for a kmalloced address, but this is a violation of the abstraction: kmalloc does not return pages, it returns another type of memory object. On the other hand, the remap_pfn_range() approach is legal because remap_pfn_range() does not touch the underlying struct pages - it works on another level.
It is also worth mentioning that it is legal to remap buffers allocated by vmalloc() through the nopage handler, thanks to the vmalloc_to_page() function.
Introducing nopfn
The LWN article referenced above additionally discussed the proposal of a new VM operation named nopfn'. nopfn basically solves the nopage problem discussed above: nopage does not allow you to remap addresses that do not have corresponding page structure pointers, but nopfn lets you remap based on physical address.
To implement a nopfn handler:
- Find the physical address of the page that you want to remap based on the VMA address. Convert it to a PFN by right-shifting PAGE_SHIFT times.
- Call vm_insert_pfn() to modify the process address space.
- Return NOPFN_REFAULT
You must also set the VM_PFNMAP flag in vma->vm_flags from your mmap handler.
nopfn was introduced in Linux 2.6.19.
Migrating nopage to fault
Linux 2.6.23 introduced an alternative for the nopage API, called fault. As usual, LWN has a good article. The nopage API was later removed when no users remained.
The migration from nopage to fault is quite simple, and there are plenty of examples in the kernel history.
Migrating nopfn to fault
fault intended to replace nopfn too, but this did not happen until Linux 2.6.26. nopfn will be removed in a future release, in favour of doing nopfn remappings through the fault handler.
Migrating is fairly easy, again set the VM_PFNMAP flag on the VMA and call vm_insert_pfn() from your fault handler. Return a NULL page where you might have otherwise returned a struct page pointer in the vm_fault structure, and return 0 from your fault handler.
It may appear possible to implement a PFN-based remapper through fault with pre-2.6.26 kernels, but don't bother: you'll hit a kernel BUG() - the fault interface wasn't capable of doing PFN-based remappings in earlier releases.
mmap and real files
This is a cut'n'paste of an IRC conversation on the #kernelnewbies channel. One day this should be rewritten into a more easily readable article...
<bronaugh> if you're using mmap on a file descriptor, how are the changes eventually written to disk? what gets called? <bronaugh> does the normal read/write function eventually get called? <riel> bronaugh: two times <riel> bronaugh: changes are written to disk either at/after msync(2) time, or after munmap(2) time <riel> bronaugh: or, if the system has a memory shortage, by the pageout code <bronaugh> alright. and that uses the normal read/write calls? <rene> but I believe he means if the actual sys_read() / sys_write() code ie getting called. to that, no, the actual "dirty" pages are written <riel> bronaugh: no <riel> bronaugh: data changed through mmap does not go through read/write syscalls <bronaugh> ok. here's why I'm asking. <bronaugh> I'm modifying framebuffer code for some nefarious purposes. I don't want a memory-backed framebuffer; I want all calls like that to go over the network. <bronaugh> now, framebuffers have an fb_read and an fb_write call associated with them. these end up being called in fbmem.c by the main handler for read and write, which is set up in the file_operation struct. <bronaugh> my question is -- will those routines be called? <bronaugh> (given that they will be called normally by a read/write system call) <-- SGH has quit (Quit: Client exiting) <bronaugh> sorry if I might be a bit confusing here.. just trying to get a handle on it myself <riel> if you set those routines as the mmap read and write functions, yes <bronaugh> ohh, special functions. ok. <bronaugh> I'll dig into that. <riel> you can set them at mmap(2) time <bronaugh> ok, so how does one do that? <bronaugh> (set the mmap read and write functions) <riel> lets take a look at drivers/video/skeletonfb.c <riel> static struct fb_ops xxxfb_ops = { <bronaugh> alright. <bronaugh> wish I'd looked at that. heh. <riel> you can see it set .fb_read and .fb_write and .fb_mmap functions ? <bronaugh> yup. <bronaugh> I've set those up in my driver. <bronaugh> they're stubbed but present. <riel> wait, I forgot something important that is device driver specific <riel> on a frame buffer, you want writes to show up on the screen immediately <riel> you don't want to wait on msync() for your changes to hit the screen <bronaugh> yeah. <bronaugh> but this is a network framebuffer, so batching up writes is a plus. <bronaugh> though you don't want to go -too- far with that. <bronaugh> we'll just say it's a normal framebuffer as a simplifying assumption. <bronaugh> normal but remote <bronaugh> (ie, not in the same memory space) <riel> one thing you could do every once in a while is initiate the msync from kernel space <riel> not the cheapest thing to do, but ... <bronaugh> it'd work in a pinch. <riel> easy to verify the functionality, transparently to userspace <bronaugh> ok so... back on topic. I don't see skeletonfb having an mmap func, just a stub. <bronaugh> sorry. not a stub, just a declaration with no implementation. <riel> indeed, the mmap function is in fbmem.c <bronaugh> the main one, yeah. but that dispatches to others if they are present. <bronaugh> I've looked at the main one, but I don't understand io_remap_pfn_range. <bronaugh> I've followed the code, I know that eventually it mucks with page table entries. <bronaugh> but beyond that it is opaque to me. <riel> bronaugh: basically it maps physical addresses to page table entries <riel> bronaugh: and may not be what you want when your frame buffer is backed by non-physically contiguous memory <bronaugh> yeah, I was wondering about that. <riel> I'm wondering if you might be better off hacking up ramfs and using a virtual file as your framebuffer <bronaugh> so is there an alternate type of memory mapping I can set up; one such as used with files? <bronaugh> because clearly that eventually has to call functions to do the IO; the problem is equivalent, a device with a different kind of address space. <bronaugh> hmm, filemap.c... <bronaugh> anyhow, how would one set up a mapping of that sort? <riel> make it a file inside the page cache <riel> then the VM can handle page faults for you <bronaugh> ok, that's definitely what I want. <bronaugh> but how do I go about doing that? is there somewhere I can read? <riel> try fs/ramfs/ <bronaugh> alright. <bronaugh> wow. short. <riel> ramfs was written as a demonstration of what the VFS can do (and what filesystems do not have to do themselves) <bronaugh> sounds like a worthy goal. <bronaugh> ok, hmm. generic_file_mmap <riel> you'll be able to chainsaw out lots of code from ramfs, since you won't need mounting, a directory, etc... <bronaugh> yeah. <bronaugh> it seems to me that I should be able to just plug in generic_file_mmap as my mmap handler. <bronaugh> but - I need to see the code first.