<nitingupta910 AT gmail DOT com>

Manages storage for variable sized data objects

It is designed especially for embedded devices.

This page describes the problem it tries to solve and its design details.

Problem Statement
Normally when you allocate arbitrary sized objects using kmalloc()/vmalloc() there is big space wastage due to internal fragmentation. So, if memory is at premium, tight storage is required for these variables sized data items which is what VStore does. This however comes at cost of some speed.

How to use it?
To store data:

vstore_write(objectID /* out */, data_to_store, len)

To restore this data:

vstore_read(objectID, buffer, len /* out */, alloc_flags)

NOTE: In vstore_read(), make sure that buffer is big enough to store data assoc. with object 'ObjectID' - the 'len' (out) param will tell how much data was read into the buffer. Thus, to make sure that buffer size is always sufficient, either you know maximum size for objects stored (more common case) or you will have to track size of each object stored to provide buffer of right size.

Design Details

Design Goals

Minimize fragmentation
Minimize metadata overhead
Fast store and restore
Provide dense storage - minimize spreading of chunks associated with any particular data object into multiple pages. This improves locality and hence performance.
Portability across different archs

Each data item stored is split into multiple chunks (small contiguous physical space) and these chunks can then be spread across multiple physical pages. The Metadata associated with each chunk is stored in the beginning of chunk itself with necessary padding added to maintain alignment constraints.

Chunks can be one of three types:

Free chunk
Busy chunk with next chunk in same page
Busy chunk with next chunk in different page

Following gives metadata layout for each of these types of chunks:

Some jargon first:

M     = Metadata - size has to be n * ALIGN_BYTES
M(A)  = address of next chunk relative to page start (PAGE_SHIFT - ALIGN_SHIFT bits)
M(S)  = size of chunk (same as M(A))
F     = flags (2 bits)
        F(1): last chunk
        F(2): next chunk in diff page
FP    = prev chunk's flags
FN    = next chunk's flags
PFN   = PageFrameNo where next chunk exists (BITS_PER_LONG - PAGE_SHIFT bits)
OD(i) = i-th optional metadata field. Optional fields are enclosed in []

- Free chunk

M(A) | F | M(S) | pad(align) | <free space> | pad(align)

- Busy chunk type-1 (next chunk in same page)

M(A) | F | M(S) | [ M(S) of prev chunk ] | pad(align) | <data> | pad(align)
        - OD(1) exists if F(2) is not set

- Busy chunk type-2 (next chunk in different page)

M(A) | F | PFN | [ M(S) of prev chunk ] | [ M(S) of this chunk ] | pad(align) | <data> | pad(align)
        - only OD(1) exists if          FP(2) && !F(1)
        - only OD(2) exists if          F(1) && !FP(2)
        - both OD(1) and OD(2) exist if FP(2) && F(1)

NOTE:

Space is not reserved for optional fields that do not exist. Flags let us know which fields really exist.
Chunks must begin at (n * ALIGN_BYTES) and their size (including metadata) must be (m * ALIGN_BYTES). Thus required amount of padding is added after metadata and at end of data/free space.

Chunk management

A Chunk never crosses page boundary.
Adjacent free chunks are always merged.
Chunks belonging to a data item are singly linked (using metadata in beginning of chunk).
Free lists are per page. Free chunks are (singly) links in order of increasing physical address (this eases merging of free chunks and is more hardware caches friendly).
Pages are then placed in free lists according to total free space they have. Pages with maximum free space are selected first for allocation. This gives a dense storage (i.e. we avoid chunks belonging to same data item being spread across many pages).

                 Freelists
       +---------+---------+---------+---------+---------+---------+---------+---------+
 Order |    4    |    5    |    6    |   --- --- --- --- |    10   |   11    |    12   |
       +---------+---------+---------+---------+---------+---------+---------+---------+
              ^                  ^                                     ^
              | page->lru        |----|                                |
              V                       V                                V
          +----------+  offset -> +----------+               pages with [2^11, 2^12)
          |    1     |            |  | F     |                     bytes free
page->    |  | M(A)  |      +---> +--|-------+
 offset-->+--|-------+      |     |  | 2   | |
          |  | F  |  +      |     +--|-----|-+
          +--|----|--+      |     |  | 3   | |
          |  | 2  |  |------+     +--|-----|-+
          +--|----|--+ PFN + M(A) |  |     | |
          |  | F  V  |            |  V F   | |
          +--|-------+            +--------|-+
          |  V 1     |            |    2   V |
          +----------+            +----------+
              ^
              | page->lru
              V
          more pages

NOTE:

F = Free chunk
Same numbered chunks belong to same data item.
For every physical page, the underlying stuct page->offset field points to first free chunk in that page (i.e. per page free list).
Free space in a page is stored in page->private
Pages are added to one of the free list according to amount of free space they have.

Implementation

Now, we can represent these three chunk types as:

#define ALIGN_SHIFT    2
#define ALIGN_BYTES    (1 << ALIGN_SHIFT)
#define MA_SIZE        PAGE_SHIFT - ALIGN_SHIFT
#define MS_SIZE        MA_SIZE

- Representing Free chunk

struct free_chunk {
        unsigned long MA: MA_SIZE;
        unsigned long flags: 2;
        unsigned long MS: MS_SIZE;
} __attribute__ ((packed));

- Representing Busy chunk type-1 (next chunk in same page)

struct busy_type1_chunk {
        unsigned long MA: MA_SIZE;
        unsigned long flags: 2;
        unsigned long MS1: MS_SIZE;
        unsigned long MS2: MS_SIZE;
} __attribute__ ((packed));

- Representing Busy chunk type-2 (next chunk in different page)

struct busy_type2_chunk {
        unsigned long MA: MA_SIZE;
        unsigned long flags: 2;
        unsigned long PFN: BITS_PER_LONG - PAGE_SHIFT;
        unsigned long MS1: MS_SIZE;
        unsigned long MS2: MS_SIZE;
} __attribute__ ((packed));

VStore operations

vstore_read()

/*
 * Args (IN):
 *      obj_id: handle of the required object (as returned by vstore_write)
 *      data: buffer space to read object data into
 * Args (OUT)
 *      len: length of data read into buffer
 *
 * Return:
 *      0 on success, <0 on error (should NEVER occur)
 *      Errors:
 *              -ENOMEM
 *
 * NOTE: Make sure that buffer 'data' has enough space for object 'obj_id'.
 */
int vstore_read(unsigned long obj_id, void *data, unsigned int *len);

vstore_write()

/*
 * Args (IN):
 *      data: pointer to data to be copied to vstore
 *      len: data length (in bytes)
 *      alloc_flags: allocation flags used when we need to allocate more chunks to store this data (e.g. GFP_KERNEL)
 * Args (OUT):
 *      obj_id: handle to object stored
 *
 * Return:
 *      0 on success, <0 on error
 *      Errors:
 *              -ENOMEM
 */
int vstore_write(unsigned long *obj_id, void *data, unsigned int *len, unsigned long alloc_flags);

vstore_read
Time: O(num_chunks)

Sub ops:

Collect data from all chunks into given buffer - O(num_chunks)
Free these chunks and merge them with adjacent free chunks - O(num_chunks)

vstore_write
Time: O(num_chunks) (if enough free chunks are already in freelists)

Sub ops:

Allocate required no. of chunks from freelists - O(num_chunks)
If sufficient free chunks are not available, allocate additional chunks (allocate 0-order page and add as single chunk in highest order freelist) - might sleep depending on alloc_flags.
Copy data to these chunks - O(num_chunks)