Up to this point, we've been using simple test databases where the data for each row fits within a single page. However, in the wild, it is quite common for a row to be larger than a single page (typically 4096 bytes), especially when using variable-length fields like TEXT or BLOB. How does SQLite handle such cases?

In this post, we'll explore the overflow mechanism in SQLite and implement it in our own toy database, allowing us to read large texts and blobs.

As usual, the source code for this post is available on Github.

Overview

Overflow pages structure
------------------------ 

Page 3                                Page 4
┌─────────────────────────────────┐   ┌─────────────────────────────────┐
│ Next: Page 4 │   Data...        │┌─>│ Next: NULL   │   Data...        │
└─────────────────────────────────┘│  └─────────────────────────────────┘
       │                           │
       └───────────────────────────┘

When a field's length exceeds the maximum usable page size, our database has two questions to answer:

where and in what format to store data that doesn't fit in a page?
what to write in the B-tree cell?

The answer to the first question is quite elegant. The data that doesn't fit is split into multiple overflow pages, using the following structure: the first four bytes of every overflow page contain the page number of the next overflow page (or 0 if there are no more overflow pages), and the rest of the page contains the overflow data. Therefore, overflow pages form a linked list.

As for the second question, the B-tree cell will contain the firstN bytes of the complete payload (where N is calculated using a formula we'll explore in the following sections), followed by the page number of the first overflow page.

With this quick overview, let's dive in the implementation!

Rust edition

Some code in this post uses the new Let chains feature from Rust 1.88.0. If you want to follow along, make sure to update your Cargo.toml to use the 2024 edition:

# Cargo.toml
[package]
# [...]
edition = "2024"

Erratum

If you're one of the early readers of the first post, and you coded along, a small mistake might have slipped into your code: the read_varint_at function - which decodes a variable-length integer - treats varints as little-endian, while SQLite uses big-endian encoding. Here is the corrected version:

pub fn read_varint_at(buffer: &[u8], mut offset: usize) -> (u8, i64) {
    let mut size = 0;
    let mut result = 0;

    while size < 9 {
        let current_byte = buffer[offset] as i64;
        if size == 8 {
            result = (result << 8) | current_byte;
        } else {
            result = (result << 7) | (current_byte & 0b0111_1111);
        }

        offset += 1;
        size += 1;

        if current_byte & 0b1000_0000 == 0 {
            break;
        }
    }

    (size, result)
}

You'll notice that we now shift the result accumulator at each iteration instead of shifting the current_byte. The ninth byte special case is also handled differently, as it is no longer the most significant byte.

Building the test database

To test our implementation, we'll need a test database that contains rows with large enough fields to trigger the overflow mechanism. To create such a database, you can use the following commands:

sqlite3 test.db
sqlite> create table t1(id integer, value string);
sqlite> insert into t1(id, value) values (42, printf('%.*c', 10000, 'a'));

printf('%.*c', 10000, 'a') returns a string composed of 10000 as.

Computing the local payload size

As we mentioned, when reading a record that's too big to fit in a single page, our first task is to compute the local payload size. This is the number of bytes that will be stored directly in the B-tree leaf cell. To compute this size, we first need to define a few variables:

P, the payload size
U, the usable page size (which is the page size minus the number of reserved bytes)
X = U - 35, which is the overflow threshold: if the payload size is less than or equal to X it will be stored entirely in a B-tree leaf cell, without overflow
M = ((U-12)*32/255)-23, the minimum local payload size
K = M+((P-M)%(U-4)), the maximum local payload size

The local payload size is computed according to the following rules:

If P <= X, the local payload size is P and there is no overflow.
Otherwise, there is an overflow, and:
- If P <= K, the local payload size is K.
- Otherwise, the local payload size is M.

We'll start by implementing the computation of U, the usable page size. It differs from the page size since SQLite reserves a few bytes at the end of each page for use by extensions. The exact number of reserved bytes is defined by a two-byte integer in the database header, at offset 20. First, we'll extend our DbHeader struct to include the number of reserved bytes:

// src/page.rs

#[derive(Debug, Copy, Clone)]
pub struct DbHeader {
    pub page_size: u32,
+   pub page_reserved_size: u8,
}

+impl DbHeader {
+    pub fn usable_page_size(&self) -> usize {
+        self.page_size as usize - (self.page_reserved_size as usize)
+    }
+}

We also added a utility method that returns the value of U, computed from the page size and reserved bytes count.

Our db header parsing function must also be updated to populate the new field:

// src/pager.rs
+const HEADER_PAGE_RESERVED_SIZE_OFFSET: usize = 20;

// [...]

pub fn parse_header(buffer: &[u8]) -> anyhow::Result<page::DbHeader> {
    if !buffer.starts_with(HEADER_PREFIX) {
        let prefix = String::from_utf8_lossy(&buffer[..HEADER_PREFIX.len()]);
        anyhow::bail!("invalid header prefix: {prefix}");
    }

    let page_size_raw = read_be_word_at(buffer, HEADER_PAGE_SIZE_OFFSET);
    let page_size = match page_size_raw {
        1 => PAGE_MAX_SIZE,
        n if n.is_power_of_two() => n as u32,
        _ => anyhow::bail!("page size is not a power of 2: {}", page_size_raw),
    };

+   let page_reserved_size = buffer[HEADER_PAGE_RESERVED_SIZE_OFFSET];

-   Ok(page::DbHeader { page_size })
+   Ok(page::DbHeader {
+       page_size,
+       page_reserved_size,
+   })
}

With this in place, we can implement the local_payload_size method according to the formulas defined above:

// src/page.rs

// [...]

impl PageHeader {
    fn local_payload_size(
        &self,
        db_header: &DbHeader,
        payload_size: usize,
    ) -> anyhow::Result<usize> {
        match self.page_type {
            PageType::TableInterior => bail!("no payload size for interior pages"),
            PageType::TableLeaf => {
                let usable = db_header.usable_page_size();
                let max_size = usable - 35;
                if payload_size <= max_size {
                    return Ok(payload_size);
                }
                let min_size = ((usable - 12) * 32 / 255) - 23;
                let k = min_size + ((payload_size - min_size) % (usable - 4));
                let size = if k <= max_size { k } else { min_size };
                Ok(size)
            }
        }
    }

    pub fn local_and_overflow_size(
        &self,
        db_header: &DbHeader,
        payload_size: usize,
    ) -> anyhow::Result<(usize, Option<usize>)> {
        let local = self.local_payload_size(db_header, payload_size)?;
        if local == payload_size {
            Ok((local, None))
        } else {
            Ok((local, Some(payload_size.saturating_sub(local))))
        }
    }
}

For table-leaf pages, the implementation of local_payload_size is a straightforward translation of the formulas. For interior pages, we simply return an error, as they do not contain an actual payload, so the concept of a local payload size does not apply. In future posts, we'll discover index pages, which will require us to implement a slightly altered version of the formulas.

We also defined a convenient method that computes the local payload size and the overflow size, if any.

How are we going to use this information? When we parse a B-tree leaf cell, we'll compute the local and overflow sizes. If the overflow size is not None, we know that a pointer to the first overflow page is stored right after the local payload, so we'll read this pointer and record it in the resulting Cell struct. We don't want to read the content of the overflow pages just yet, as we have no way to know if the query will actually need it.

Let's modify our Cell struct and the corresponding parsing function:

// src/page.rs

// [...]

#[derive(Debug, Clone)]
pub struct TableLeafCell {
pub size: i64,
    pub size: i64,
    pub row_id: i64,
    pub payload: Vec<u8>,
+   pub first_overflow: Option<usize>,
}

// src/pager.rs

// [...] 

-fn parse_table_leaf_cell(mut buffer: &[u8]) -> anyhow::Result<page::Cell> {
+fn parse_table_leaf_cell(
+   db_header: &DbHeader,
+   header: &PageHeader,
+   mut buffer: &[u8],
+) -> anyhow::Result<page::Cell> {
    let (n, size) = read_varint_at(buffer, 0);
    buffer = &buffer[n as usize..];

    let (n, row_id) = read_varint_at(buffer, 0);
    buffer = &buffer[n as usize..];

+   let (local_size, overflow_size) = header.local_and_overflow_size(db_header, size as usize)?;
+   let first_overflow = overflow_size.map(|_| read_be_double_at(buffer, local_size) as usize);

-   let payload = buffer[..size as usize].to_vec();
+   let payload = buffer[..local_size].to_vec();

    Ok(page::TableLeafCell {
        size,
        row_id,
        payload,
        first_overflow,
    }
    .into())
}

The modifications to parse_table_leaf_cell are straightforward: we leverage our utility method to compute the local and overflow sizes, as well as the first overflow page pointer, if any. Note that we modified the function's signature to accept a reference to the database header, so you'll need to propagate this change to the caller, and adapt the signature of parse_table_interior_cell as well.

We're not far from having a complete implementation of the overflow mechanism. What's left is to implement a way to read and parse the overflow pages, and exercise it when reading fields from a cursor.

Reading overflow pages

Before we implement the parsing, we'll create a type to represent an overflow page:

// src/page.rs

// [...]

#[derive(Debug, Clone)]
pub struct OverflowPage {
    pub next: Option<usize>,
    pub payload: Vec<u8>,
}

Parsing an overflow page is quite simple: the first four bytes contain the next overflow page pointer (or 0 if there are no more overflow pages), and the rest of the page contains the overflow payload,

// src/pager.rs

// [...]

fn parse_overflow_page(buffer: &[u8]) -> page::OverflowPage {
    let next = read_be_double_at(buffer, 0);
    page::OverflowPage {
        payload: buffer[4..].to_vec(),
        next: if next != 0 { Some(next as usize) } else { None },
    }
}

Since our Pager's cache expects to only store Page structs, we need to adapt it so it can read and cache either a Page or an OverflowPage. First we'll define a new enum to represent either a Page or an OverflowPage:

// src/pager.rs

// [...]

#[derive(Debug, Clone)]
enum CachedPage {
    Page(Arc<page::Page>),
    Overflow(Arc<page::OverflowPage>),
}

impl From<Arc<page::Page>> for CachedPage {
    fn from(value: Arc<page::Page>) -> Self {
        CachedPage::Page(value)
    }
}

impl TryFrom<CachedPage> for Arc<page::Page> {
    type Error = anyhow::Error;

    fn try_from(value: CachedPage) -> Result<Self, Self::Error> {
        if let CachedPage::Page(p) = value {
            Ok(p.clone())
        } else {
            bail!("expected a regular page")
        }
    }
}

impl From<Arc<page::OverflowPage>> for CachedPage {
    fn from(value: Arc<page::OverflowPage>) -> Self {
        CachedPage::Overflow(value)
    }
}

impl TryFrom<CachedPage> for Arc<page::OverflowPage> {
    type Error = anyhow::Error;

    fn try_from(value: CachedPage) -> Result<Self, Self::Error> {
        if let CachedPage::Overflow(o) = value {
            Ok(o.clone())
        } else {
            bail!("expected an overflow page")
        }
    }
}

It's a simple enum with two variants and a few conversion traits to allow our Pager to handle both types seamlessly.

Then, we need to adapt our Pager to support both types. We'll extract the bulk of the logic into a new genericload method that takes a page number and a parsing function and parses uncached pages with the provided function. Here is the updated Pager implementation:

// src/pager.rs

// [...]

#[derive(Debug)]
pub struct Pager<I: Read + Seek = std::fs::File> {
    input: Arc<Mutex<I>>,
    pages: Arc<RwLock<HashMap<usize, CachedPage>>>,
    header: DbHeader,
}

impl<I: Read + Seek> Pager<I> {
    pub fn new(header: DbHeader, input: I) -> Self {
        Self {
            input: Arc::new(Mutex::new(input)),
            pages: Arc::default(),
            header,
        }
    }

    pub fn read_overflow(&self, n: usize) -> anyhow::Result<Arc<page::OverflowPage>> {
        self.load(n, |buffer| Ok(parse_overflow_page(buffer)))
    }

    pub fn read_page(&self, n: usize) -> anyhow::Result<Arc<page::Page>> {
        self.load(n, |buffer| parse_page(&self.header, buffer, n))
    }

    fn load<T>(&self, n: usize, f: impl Fn(&[u8]) -> anyhow::Result<T>) -> anyhow::Result<Arc<T>>
    where
        Arc<T>: Into<CachedPage>,
        CachedPage: TryInto<Arc<T>, Error=anyhow::Error>,
    {
        {
            let read_pages = self
                .pages
                .read()
                .map_err(|_| anyhow!("poisoned page cache lock"))?;

            if let Some(page) = read_pages.get(&n).cloned() {
                return page.try_into();
            }
        }

        let mut write_pages = self
            .pages
            .write()
            .map_err(|_| anyhow!("failed to acquire pager write lock"))?;

        if let Some(page) = write_pages.get(&n).cloned() {
            return page.try_into();
        }

        let buffer = self.load_raw(n)?;
        let parsed = f(&buffer[0..self.header.usable_page_size()])?;
        let ptr = Arc::new(parsed);

        write_pages.insert(n, ptr.clone().into());

        Ok(ptr)
    }

    fn load_raw(&self, n: usize) -> anyhow::Result<Vec<u8>> {
        let offset = n.saturating_sub(1) * self.header.page_size as usize;

        let mut input_guard = self
            .input
            .lock()
            .map_err(|_| anyhow!("poisoned pager mutex"))?;

        input_guard
            .seek(SeekFrom::Start(offset as u64))
            .context("seek to page start")?;

        let mut buffer = vec![0; self.header.page_size as usize];
        input_guard.read_exact(&mut buffer).context("read page")?;

        Ok(buffer)
    }
}

Putting it all together

The main building blocks of our implementation are in place: we now how to detect when a row is too large to fit in a single page, and we have a way to load overflow pages through our Pager. The last step is to lazily read the overflow data when accessing a field that requires it.

To do this, we'll implement an OverflowScanner with a read method that takes as input the index of the first overflow page and the minimum amount of overflow data to read. The scanner will follow the linked list until the required amount of data is read of there are no more overflow pages.

// src/cursor.rs

// [...]

#[derive(Debug)]
struct OverflowScanner {
    pager: Pager,
}

impl OverflowScanner {
    pub fn new(pager: Pager) -> Self {
        Self { pager }
    }

    pub fn read(&self, first_page: usize, size: usize) -> anyhow::Result<(Option<usize>, Vec<u8>)> {
        let mut next_page = Some(first_page);
        let mut buffer = Vec::with_capacity(size);

        while buffer.len() < size
            && let Some(next) = next_page
        {
            let overflow = self.pager.read_overflow(next)?;
            next_page = overflow.next;
            buffer.extend_from_slice(&overflow.payload);
        }

        Ok((next_page, buffer))
    }
}

Our Cursor will use this new scanner in the following way: When reading a field, we'll compute the end offset of the field (based on the field's offset and size). If that end offset exceeds the size of the currently loaded payload, we'll read (end_offset - payload.len()) bytes through the overflow scanner, and append the result to the payload. If we wanted to read fields that are so large that they can't fit in RAM, we should implement a more sophisticated streaming mechanism, but for our purposes, reading the overflow data into memory is enough.

We'll start by implementing a utility method to compute the end offset of a field:

// src/cursor.rs

// [...]

impl RecordField {
    pub fn end_offset(&self) -> usize {
        let size = match self.field_type {
            RecordFieldType::Null => 0,
            RecordFieldType::I8 => 1,
            RecordFieldType::I16 => 2,
            RecordFieldType::I24 => 3,
            RecordFieldType::I32 => 4,
            RecordFieldType::I48 => 5,
            RecordFieldType::I64 => 8,
            RecordFieldType::Float => 8,
            RecordFieldType::Zero => 0,
            RecordFieldType::One => 0,
            RecordFieldType::String(size) | RecordFieldType::Blob(size) => size,
        };

        self.offset + size
    }
}

Next, we'll implement the overflow reading logic in the Cursor:

// src/cursor.rs

// [...]

#[derive(Debug)]
pub struct Cursor {
    header: RecordHeader,
    payload: Vec<u8>,
+   pager: Pager,
+   next_overflow_page: Option<usize>,
}

impl Cursor {
-   pub fn owned_field(&self, n: usize) -> Option<OwnedValue> {
-       self.field(n).map(Into::into)
-   }
+   pub fn owned_field(&mut self, n: usize) -> anyhow::Result<Option<OwnedValue>> {
+       Ok(self.field(n)?.map(Into::into))
+   }

-   pub fn field(&self, n: usize) -> Option<Value> {
+   pub fn field(&mut self, n: usize) -> anyhow::Result<Option<Value>> {
-       let record_field = self.header.fields.get(n)?;
+       let Some(record_field) = self.header.fields.get(n) else {
+           return Ok(None);
+       };

+       let end_offset = record_field.end_offset();

+       if end_offset > (self.payload.len() - 1)
+           && let Some(overflow_page) = self.next_overflow_page
+       {
+           let overflow_size = end_offset.saturating_sub(self.payload.len());
+           let (next_overflow, overflow_data) = OverflowScanner::new(self.pager.clone())
+               .read(overflow_page, overflow_size)
+               .context("read overflow page")?;
+           self.next_overflow_page = next_overflow;
+           self.payload.extend_from_slice(&overflow_data);
+       }

-       match record_field.field_type {
+       let value = match record_field.field_type {
            RecordFieldType::Null => Some(Value::Null),
            RecordFieldType::I8 => Some(Value::Int(read_i8_at(&self.payload, record_field.offset))),
            RecordFieldType::I16 => {
                Some(Value::Int(read_i16_at(&self.payload, record_field.offset)))
            }
            RecordFieldType::I24 => {
                Some(Value::Int(read_i24_at(&self.payload, record_field.offset)))
            }
            RecordFieldType::I32 => {
                Some(Value::Int(read_i32_at(&self.payload, record_field.offset)))
            }
            RecordFieldType::I48 => {
                Some(Value::Int(read_i48_at(&self.payload, record_field.offset)))
            }
            RecordFieldType::I64 => {
                Some(Value::Int(read_i64_at(&self.payload, record_field.offset)))
            }
            RecordFieldType::Float => Some(Value::Float(read_f64_at(
                &self.payload,
                record_field.offset,
            ))),
            RecordFieldType::String(length) => {
                let value = std::str::from_utf8(
                    &self.payload[record_field.offset..record_field.offset + length],
                )
                .expect("invalid utf8");
                Some(Value::String(Cow::Borrowed(value)))
            }
            RecordFieldType::Blob(length) => {
                let value = &self.payload[record_field.offset..record_field.offset + length];
                Some(Value::Blob(Cow::Borrowed(value)))
            }
            RecordFieldType::One => Some(Value::Int(1)),
            RecordFieldType::Zero => Some(Value::Int(0)),
-       }
_       };

+       Ok(value)
    }
}

Note that we added fields to the Cursor and slightly modified its methods signature. These changes will need to be propagated to the consumers of the Cursor struct.

Conclusion

This concludes our implementation of the overflow mechanism. Our little database can now read large TEXT and BLOB fields that are split across multiple pages. In the next post, we'll get back to the query engine and implement simple WHERE clauses, allowing us to filter rows based on their content.

Build your own SQLite, Part 6: Overflow pages

Table of contents