Build your own SQLite, Part 6: Overflow pages


Up to this point, we've been using simple test databases where the data for each
row fits within a single page. However, in the wild, it is quite common for a row
to be larger than a single page (typically 4096 bytes), especially when using
variable-length fields like TEXT
or BLOB
. How does SQLite handle such
cases?
In this post, we'll explore the overflow mechanism in SQLite and implement
it in our own toy database, allowing us to read large text
s and blob
s.
As usual, the source code for this post is available on Github.
Overview
Overflow pages structure
------------------------
Page 3 Page 4
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
│ Next: Page 4 │ Data... │┌─>│ Next: NULL │ Data... │
└─────────────────────────────────┘│ └─────────────────────────────────┘
│ │
└───────────────────────────┘
When a field's length exceeds the maximum usable page size, our database has two questions to answer:
- where and in what format to store data that doesn't fit in a page?
- what to write in the B-tree cell?
The answer to the first question is quite elegant. The data that doesn't fit is split into
multiple overflow pages, using the following structure: the first four bytes of every
overflow page contain the page number of the next overflow page (or 0
if there
are no more overflow pages), and the rest of the page contains the overflow data.
Therefore, overflow pages form a linked list.
As for the second question, the B-tree cell
will contain the firstN
bytes of the complete payload (where N
is calculated
using a formula we'll explore in the following sections), followed by the page
number of the first overflow page.
With this quick overview, let's dive in the implementation!
Rust edition
Some code in this post uses the new Let chains
feature from Rust 1.88.0. If you want to follow along, make sure to update your
Cargo.toml
to use the 2024 edition:
# Cargo.toml
[package]
# [...]
edition = "2024"
Erratum
If you're one of the early readers of the first post, and you coded along, a small mistake might
have slipped into your code: the read_varint_at
function - which decodes a variable-length integer -
treats varints as little-endian, while SQLite uses big-endian encoding.
Here is the corrected version:
pub fn read_varint_at(buffer: &[u8], mut offset: usize) -> (u8, i64) {
let mut size = 0;
let mut result = 0;
while size < 9 {
let current_byte = buffer[offset] as i64;
if size == 8 {
result = (result << 8) | current_byte;
} else {
result = (result << 7) | (current_byte & 0b0111_1111);
}
offset += 1;
size += 1;
if current_byte & 0b1000_0000 == 0 {
break;
}
}
(size, result)
}
You'll notice that we now shift the result
accumulator at each iteration
instead of shifting the current_byte
. The ninth byte special case is also
handled differently, as it is no longer the most significant byte.
Building the test database
To test our implementation, we'll need a test database that contains rows with large enough fields to trigger the overflow mechanism. To create such a database, you can use the following commands:
sqlite3 test.db
sqlite> create table t1(id integer, value string);
sqlite> insert into t1(id, value) values (42, printf('%.*c', 10000, 'a'));
printf('%.*c', 10000, 'a')
returns a string composed of 10000a
s.
Computing the local payload size
As we mentioned, when reading a record that's too big to fit in a single page, our first task is to compute the local payload size. This is the number of bytes that will be stored directly in the B-tree leaf cell. To compute this size, we first need to define a few variables:
P
, the payload sizeU
, the usable page size (which is the page size minus the number of reserved bytes)X = U - 35
, which is the overflow threshold: if the payload size is less than or equal toX
it will be stored entirely in a B-tree leaf cell, without overflowM = ((U-12)*32/255)-23
, the minimum local payload sizeK = M+((P-M)%(U-4))
, the maximum local payload size
The local payload size is computed according to the following rules:
- If
P <= X
, the local payload size isP
and there is no overflow. - Otherwise, there is an overflow, and:
- If
P <= K
, the local payload size isK
. - Otherwise, the local payload size is
M
.
- If
We'll start by implementing the computation of U
, the usable page size.
It differs from the page size since SQLite reserves a few bytes at the
end of each page for use by extensions.
The exact number of reserved bytes is defined by a two-byte integer in the
database header, at offset 20.
First, we'll extend our DbHeader
struct to include the number of reserved bytes:
// src/page.rs
#[derive(Debug, Copy, Clone)]
pub struct DbHeader {
pub page_size: u32,
+ pub page_reserved_size: u8,
}
+impl DbHeader {
+ pub fn usable_page_size(&self) -> usize {
+ self.page_size as usize - (self.page_reserved_size as usize)
+ }
+}
We also added a utility method that returns the value of U
, computed from the page size
and reserved bytes count.
Our db header parsing function must also be updated to populate the new field:
// src/pager.rs
+const HEADER_PAGE_RESERVED_SIZE_OFFSET: usize = 20;
// [...]
pub fn parse_header(buffer: &[u8]) -> anyhow::Result<page::DbHeader> {
if !buffer.starts_with(HEADER_PREFIX) {
let prefix = String::from_utf8_lossy(&buffer[..HEADER_PREFIX.len()]);
anyhow::bail!("invalid header prefix: {prefix}");
}
let page_size_raw = read_be_word_at(buffer, HEADER_PAGE_SIZE_OFFSET);
let page_size = match page_size_raw {
1 => PAGE_MAX_SIZE,
n if n.is_power_of_two() => n as u32,
_ => anyhow::bail!("page size is not a power of 2: {}", page_size_raw),
};
+ let page_reserved_size = buffer[HEADER_PAGE_RESERVED_SIZE_OFFSET];
- Ok(page::DbHeader { page_size })
+ Ok(page::DbHeader {
+ page_size,
+ page_reserved_size,
+ })
}
With this in place, we can implement the local_payload_size
method according to
the formulas defined above:
// src/page.rs
// [...]
impl PageHeader {
fn local_payload_size(
&self,
db_header: &DbHeader,
payload_size: usize,
) -> anyhow::Result<usize> {
match self.page_type {
PageType::TableInterior => bail!("no payload size for interior pages"),
PageType::TableLeaf => {
let usable = db_header.usable_page_size();
let max_size = usable - 35;
if payload_size <= max_size {
return Ok(payload_size);
}
let min_size = ((usable - 12) * 32 / 255) - 23;
let k = min_size + ((payload_size - min_size) % (usable - 4));
let size = if k <= max_size { k } else { min_size };
Ok(size)
}
}
}
pub fn local_and_overflow_size(
&self,
db_header: &DbHeader,
payload_size: usize,
) -> anyhow::Result<(usize, Option<usize>)> {
let local = self.local_payload_size(db_header, payload_size)?;
if local == payload_size {
Ok((local, None))
} else {
Ok((local, Some(payload_size.saturating_sub(local))))
}
}
}
For table-leaf pages, the implementation of local_payload_size
is a straightforward
translation of the formulas. For interior pages, we simply return an error, as they
do not contain an actual payload, so the concept of a local payload size does not apply.
In future posts, we'll discover index pages, which will require us to implement a slightly
altered version of the formulas.
We also defined a convenient method that computes the local payload size and the overflow size, if any.
How are we going to use this information? When we parse a B-tree leaf cell,
we'll compute the local and overflow sizes. If the overflow size is not None
,
we know that a pointer to the first overflow page is stored right after the local
payload, so we'll read this pointer and record it in the resulting Cell
struct.
We don't want to read the content of the overflow pages just yet, as we have no
way to know if the query will actually need it.
Let's modify our Cell
struct and the corresponding parsing function:
// src/page.rs
// [...]
#[derive(Debug, Clone)]
pub struct TableLeafCell {
pub size: i64,
pub size: i64,
pub row_id: i64,
pub payload: Vec<u8>,
+ pub first_overflow: Option<usize>,
}
// src/pager.rs
// [...]
-fn parse_table_leaf_cell(mut buffer: &[u8]) -> anyhow::Result<page::Cell> {
+fn parse_table_leaf_cell(
+ db_header: &DbHeader,
+ header: &PageHeader,
+ mut buffer: &[u8],
+) -> anyhow::Result<page::Cell> {
let (n, size) = read_varint_at(buffer, 0);
buffer = &buffer[n as usize..];
let (n, row_id) = read_varint_at(buffer, 0);
buffer = &buffer[n as usize..];
+ let (local_size, overflow_size) = header.local_and_overflow_size(db_header, size as usize)?;
+ let first_overflow = overflow_size.map(|_| read_be_double_at(buffer, local_size) as usize);
- let payload = buffer[..size as usize].to_vec();
+ let payload = buffer[..local_size].to_vec();
Ok(page::TableLeafCell {
size,
row_id,
payload,
first_overflow,
}
.into())
}
The modifications to parse_table_leaf_cell
are straightforward: we leverage
our utility method to compute the local and overflow sizes, as well as
the first overflow page pointer, if any.
Note that we modified the function's signature to accept a reference to the database header,
so you'll need to propagate this change to the caller, and adapt the signature of
parse_table_interior_cell
as well.
We're not far from having a complete implementation of the overflow mechanism. What's left is to implement a way to read and parse the overflow pages, and exercise it when reading fields from a cursor.
Reading overflow pages
Before we implement the parsing, we'll create a type to represent an overflow page:
// src/page.rs
// [...]
#[derive(Debug, Clone)]
pub struct OverflowPage {
pub next: Option<usize>,
pub payload: Vec<u8>,
}
Parsing an overflow page is quite simple: the first four bytes contain the
next overflow page pointer (or 0
if there are no more overflow pages), and
the rest of the page contains the overflow payload,
// src/pager.rs
// [...]
fn parse_overflow_page(buffer: &[u8]) -> page::OverflowPage {
let next = read_be_double_at(buffer, 0);
page::OverflowPage {
payload: buffer[4..].to_vec(),
next: if next != 0 { Some(next as usize) } else { None },
}
}
Since our Pager
's cache expects to only store Page
structs, we need to adapt
it so it can read and cache either a Page
or an OverflowPage
.
First we'll define a new enum to represent either a Page
or an OverflowPage
:
// src/pager.rs
// [...]
#[derive(Debug, Clone)]
enum CachedPage {
Page(Arc<page::Page>),
Overflow(Arc<page::OverflowPage>),
}
impl From<Arc<page::Page>> for CachedPage {
fn from(value: Arc<page::Page>) -> Self {
CachedPage::Page(value)
}
}
impl TryFrom<CachedPage> for Arc<page::Page> {
type Error = anyhow::Error;
fn try_from(value: CachedPage) -> Result<Self, Self::Error> {
if let CachedPage::Page(p) = value {
Ok(p.clone())
} else {
bail!("expected a regular page")
}
}
}
impl From<Arc<page::OverflowPage>> for CachedPage {
fn from(value: Arc<page::OverflowPage>) -> Self {
CachedPage::Overflow(value)
}
}
impl TryFrom<CachedPage> for Arc<page::OverflowPage> {
type Error = anyhow::Error;
fn try_from(value: CachedPage) -> Result<Self, Self::Error> {
if let CachedPage::Overflow(o) = value {
Ok(o.clone())
} else {
bail!("expected an overflow page")
}
}
}
It's a simple enum with two variants and a few conversion traits to allow
our Pager
to handle both types seamlessly.
Then, we need to adapt our Pager
to support both types. We'll extract
the bulk of the logic into a new genericload
method that takes a page
number and a parsing function and parses uncached pages with the provided function.
Here is the updated Pager
implementation:
// src/pager.rs
// [...]
#[derive(Debug)]
pub struct Pager<I: Read + Seek = std::fs::File> {
input: Arc<Mutex<I>>,
pages: Arc<RwLock<HashMap<usize, CachedPage>>>,
header: DbHeader,
}
impl<I: Read + Seek> Pager<I> {
pub fn new(header: DbHeader, input: I) -> Self {
Self {
input: Arc::new(Mutex::new(input)),
pages: Arc::default(),
header,
}
}
pub fn read_overflow(&self, n: usize) -> anyhow::Result<Arc<page::OverflowPage>> {
self.load(n, |buffer| Ok(parse_overflow_page(buffer)))
}
pub fn read_page(&self, n: usize) -> anyhow::Result<Arc<page::Page>> {
self.load(n, |buffer| parse_page(&self.header, buffer, n))
}
fn load<T>(&self, n: usize, f: impl Fn(&[u8]) -> anyhow::Result<T>) -> anyhow::Result<Arc<T>>
where
Arc<T>: Into<CachedPage>,
CachedPage: TryInto<Arc<T>, Error=anyhow::Error>,
{
{
let read_pages = self
.pages
.read()
.map_err(|_| anyhow!("poisoned page cache lock"))?;
if let Some(page) = read_pages.get(&n).cloned() {
return page.try_into();
}
}
let mut write_pages = self
.pages
.write()
.map_err(|_| anyhow!("failed to acquire pager write lock"))?;
if let Some(page) = write_pages.get(&n).cloned() {
return page.try_into();
}
let buffer = self.load_raw(n)?;
let parsed = f(&buffer[0..self.header.usable_page_size()])?;
let ptr = Arc::new(parsed);
write_pages.insert(n, ptr.clone().into());
Ok(ptr)
}
fn load_raw(&self, n: usize) -> anyhow::Result<Vec<u8>> {
let offset = n.saturating_sub(1) * self.header.page_size as usize;
let mut input_guard = self
.input
.lock()
.map_err(|_| anyhow!("poisoned pager mutex"))?;
input_guard
.seek(SeekFrom::Start(offset as u64))
.context("seek to page start")?;
let mut buffer = vec![0; self.header.page_size as usize];
input_guard.read_exact(&mut buffer).context("read page")?;
Ok(buffer)
}
}
Putting it all together
The main building blocks of our implementation are in place: we now
how to detect when a row is too large to fit in a single page, and
we have a way to load overflow pages through our Pager
. The last step
is to lazily read the overflow data when accessing a field that
requires it.
To do this, we'll implement an OverflowScanner
with a read
method that
takes as input the index of the first overflow page and the minimum amount
of overflow data to read. The scanner will follow the linked list until
the required amount of data is read of there are no more overflow pages.
// src/cursor.rs
// [...]
#[derive(Debug)]
struct OverflowScanner {
pager: Pager,
}
impl OverflowScanner {
pub fn new(pager: Pager) -> Self {
Self { pager }
}
pub fn read(&self, first_page: usize, size: usize) -> anyhow::Result<(Option<usize>, Vec<u8>)> {
let mut next_page = Some(first_page);
let mut buffer = Vec::with_capacity(size);
while buffer.len() < size
&& let Some(next) = next_page
{
let overflow = self.pager.read_overflow(next)?;
next_page = overflow.next;
buffer.extend_from_slice(&overflow.payload);
}
Ok((next_page, buffer))
}
}
Our Cursor
will use this new scanner in the following way:
When reading a field, we'll compute the end offset of the field (based on the field's
offset and size). If that end offset exceeds the size of the currently loaded payload,
we'll read (end_offset - payload.len()
) bytes through the overflow scanner, and
append the result to the payload. If we wanted to read fields that are so large
that they can't fit in RAM, we should implement a more sophisticated streaming
mechanism, but for our purposes, reading the overflow data into memory is enough.
We'll start by implementing a utility method to compute the end offset of a field:
// src/cursor.rs
// [...]
impl RecordField {
pub fn end_offset(&self) -> usize {
let size = match self.field_type {
RecordFieldType::Null => 0,
RecordFieldType::I8 => 1,
RecordFieldType::I16 => 2,
RecordFieldType::I24 => 3,
RecordFieldType::I32 => 4,
RecordFieldType::I48 => 5,
RecordFieldType::I64 => 8,
RecordFieldType::Float => 8,
RecordFieldType::Zero => 0,
RecordFieldType::One => 0,
RecordFieldType::String(size) | RecordFieldType::Blob(size) => size,
};
self.offset + size
}
}
Next, we'll implement the overflow reading logic in the Cursor
:
// src/cursor.rs
// [...]
#[derive(Debug)]
pub struct Cursor {
header: RecordHeader,
payload: Vec<u8>,
+ pager: Pager,
+ next_overflow_page: Option<usize>,
}
impl Cursor {
- pub fn owned_field(&self, n: usize) -> Option<OwnedValue> {
- self.field(n).map(Into::into)
- }
+ pub fn owned_field(&mut self, n: usize) -> anyhow::Result<Option<OwnedValue>> {
+ Ok(self.field(n)?.map(Into::into))
+ }
- pub fn field(&self, n: usize) -> Option<Value> {
+ pub fn field(&mut self, n: usize) -> anyhow::Result<Option<Value>> {
- let record_field = self.header.fields.get(n)?;
+ let Some(record_field) = self.header.fields.get(n) else {
+ return Ok(None);
+ };
+ let end_offset = record_field.end_offset();
+ if end_offset > (self.payload.len() - 1)
+ && let Some(overflow_page) = self.next_overflow_page
+ {
+ let overflow_size = end_offset.saturating_sub(self.payload.len());
+ let (next_overflow, overflow_data) = OverflowScanner::new(self.pager.clone())
+ .read(overflow_page, overflow_size)
+ .context("read overflow page")?;
+ self.next_overflow_page = next_overflow;
+ self.payload.extend_from_slice(&overflow_data);
+ }
- match record_field.field_type {
+ let value = match record_field.field_type {
RecordFieldType::Null => Some(Value::Null),
RecordFieldType::I8 => Some(Value::Int(read_i8_at(&self.payload, record_field.offset))),
RecordFieldType::I16 => {
Some(Value::Int(read_i16_at(&self.payload, record_field.offset)))
}
RecordFieldType::I24 => {
Some(Value::Int(read_i24_at(&self.payload, record_field.offset)))
}
RecordFieldType::I32 => {
Some(Value::Int(read_i32_at(&self.payload, record_field.offset)))
}
RecordFieldType::I48 => {
Some(Value::Int(read_i48_at(&self.payload, record_field.offset)))
}
RecordFieldType::I64 => {
Some(Value::Int(read_i64_at(&self.payload, record_field.offset)))
}
RecordFieldType::Float => Some(Value::Float(read_f64_at(
&self.payload,
record_field.offset,
))),
RecordFieldType::String(length) => {
let value = std::str::from_utf8(
&self.payload[record_field.offset..record_field.offset + length],
)
.expect("invalid utf8");
Some(Value::String(Cow::Borrowed(value)))
}
RecordFieldType::Blob(length) => {
let value = &self.payload[record_field.offset..record_field.offset + length];
Some(Value::Blob(Cow::Borrowed(value)))
}
RecordFieldType::One => Some(Value::Int(1)),
RecordFieldType::Zero => Some(Value::Int(0)),
- }
_ };
+ Ok(value)
}
}
Note that we added fields to the
Cursor
and slightly modified its methods signature. These changes will need to be propagated to the consumers of theCursor
struct.
Conclusion
This concludes our implementation of the overflow mechanism. Our little database
can now read large TEXT
and BLOB
fields that are split across multiple pages.
In the next post, we'll get back to the query engine and implement simple
WHERE
clauses, allowing us to filter rows based on their content.
Subscribe to my newsletter
Read articles from Geoffrey Copin directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
