Building KyroDB: An AI‑Native Database Kernel for the Next Decade of Data

Kishan KumarKishan Kumar
6 min read

Hey folks, I am Kishan. Let’s talk about databases today. The database landscape is evolving rapidly, from static, one-size-fits-all engines to AI-native systems that learn, adapt, and handle vector workloads natively. Think about it: With the vector DB market projected to hit $16B by 2034, we need kernels that consolidate ingestion, state, and semantic search without the sprawl.

KyroDB is my attempt to build that future from the kernel up: a durable, crash-recoverable event log with real-time streaming, a minimal SQL layer, first-class vector search, and a path to learned indexes (Recursive Model Indexes, or RMI) and Approximate Nearest Neighbor (ANN) search via HNSW. It is an AI-native database kernel that's designed from the ground up to fuse self-optimizing AI components with native support for AI workloads. Imagine a database that not only stores your data but learns from it in real-time, auto-tunes itself, and handles hybrid workloads like a pro, bridging key-value stores, vector search, and even future relational ops.

In this deep dive, I'll cover the problems KyroDB solves, what's built today, where it’s heading, the architecture, a code tour, how to run it, design trade-offs, the MVP roadmap, and why it matters. By the end, you'll know exactly how to spin it up and contribute. If you're into Rust, Go, machine learning, or just curious about next-gen databases, stick around. And hey, the entire project is open-source on GitHub – check it out here and drop a star if it sparks your interest! Let's jump in!

Problems KyroDB Addresses

Modern AI apps face real headaches:

  1. Operational sprawl: Most AI apps juggle Kafka (for ingestion), a DB (state), and a vector store (semantic search). KyroDB consolidates durable ingestion, Key value(KV) state, and vector search into one kernel and API.

  2. Read latency and memory cost: Traditional indexes hit limits at scale. Learned indexes (like RMI) can reduce memory and accelerate lookups by learning the Cumulative Distribution Function (CDF) of keys.

In our case, the CDF performs as follows:-

  • We model the key’s cumulative distribution function F(k) ≈ rank(k)/N. The predicted position is

    pos ≈ F(k)·N (i.e., where that key should sit in a sorted array). RMI learns this monotonic mapping (root + leaf models). At query time, we:

  • Predict the position

  • probe within ±ε (error bound) to locate the true key

  • This replaces/augments a B‑Tree: less memory, faster point lookups on skewed or smooth key distributions.

  1. Vector Search Barriers: High-dimensional embeddings for RAG or recommendations need simple APIs, but many solutions lack persistence or easy tuning. KyroDB exposes simple HTTP/SQL for exact L2 and feature‑gated ANN, with a roadmap for performance and persistence.

What exists today: Current Features

KyroDB is already a working prototype, built in Rust for the engine (performance) and Go for orchestration. Key highlights:

  • Durable Storage: Write-Ahead Log (wal.bin) + snapshots (snapshot.bin) for crash recovery. Idempotent appends, bounded serialization, and WAL truncation post-snapshot keep it efficient.

  • Minimal SQL Interface: INSERT/SELECT for KV and vectors. Vectors support QUERY=[…] with optional MODE='ANN'.

  • KV Model: Record { key, value } with embedded schema versions for forward compatibility.

  • Indexing:

    • Default: In-memory B-Tree via PrimaryIndex.

    • RMI Scaffolding (feature learned-index): Delta writes for updates, autoload swap, and a /rmi/build endpoint that generates a stub index file (index-rmi.bin).

  • Vector Search:

    • Exact L2 distance scan.

    • ANN (feature ann-hnsw): Lazy in-memory HNSW using the hora crate; triggered via SQL MODE='ANN'.

  • Real-Time: Replay ranges and subscribe via SSE for live streaming.

  • Observability: Prometheus metrics at /metrics (e.g., append counts, snapshot latencies).

  • Orchestrator: Go CLI (kyrodbctl) for health, offset, snapshot, SQL, lookup, vector ops, and RMI builds.


Core architecture

  • API layer: (Warp HTTP + minimal SQL).

  • Storage: (WAL + snapshots; schema versioned Events).

  • Indexing: (PrimaryIndex with BTree now, RMI later).

  • Vector engine: (exact L2; HNSW optional).

  • Observability: (Prometheus).

  • Orchestrator: (Go CLI).

Recovery model


Guided tour of the code

Let's peek inside (from repo's engine crate):

  • lib.rs: Core logic— PersistentEventLog for append/replay/subscribe/snapshot; helpers like append_kv, append_vector, search_vector_l2/ann; PrimaryIndex abstraction (BTree default, RMI gated) with lookup_key and scan fallback.

  • sql.rs: Uses sqlparser-rs for minimal parsing— INSERT for KV/vectors, SELECT by key or QUERY=[…] with MODE='ANN'.

  • main.rs: HTTP routes via Warp: /append, /replay, /subscribe, /put, /lookup, /sql, /vector/insert/search, /rmi/build (feature), /offset/health/snapshot/metrics.


How to run

Prerequisites: Rust, Go.

Basic

RUST_LOG=kyrodb=info,warp=info cargo run -p engine -- serve 127.0.0.1 3030

With ANN:

RUST_LOG=kyrodb=info,warp=info cargo run -p engine --features ann-hnsw -- serve 127.0.0.1 3030

With learned-index:

RUST_LOG=kyrodb=info,warp=info cargo run -p engine --features learned-index -- serve 127.0.0.1 3030

Orchestrator:

cd orchestrator && go build -o kyrodbctl

Orchestrator(KV):

curl -s -X POST http://127.0.0.1:3030/sql -H 'Content-Type: application/json' -d '{"sql":"INSERT INTO t VALUES (42, '"'"'hello'"'"')"}'
curl -s -X POST http://127.0.0.1:3030/sql -H 'Content-Type: application/json' -d '{"sql":"SELECT * FROM t WHERE key = 42"}'

Vector (exact):

curl -s -X POST http://127.0.0.1:3030/sql -H 'Content-Type: application/json' -d '{"sql":"INSERT INTO vectors VALUES (1, [0.1,0.2,0.3])"}'
curl -s -X POST http://127.0.0.1:3030/sql -H 'Content-Type: application/json' -d '{"sql":"SELECT * FROM vectors WHERE QUERY = [0.1,0.2,0.31] LIMIT 5"}'

ANN: Add AND MODE='ANN'

RMI: POST /rmi/build (with feature).


Design choices and trade‑offs

  • WAL + Snapshot: Simple durability; truncation bounds WAL under high writes.

  • Bounded Serialization: Guards against corruption.

  • PrimaryIndex Abstraction: Easy BTree-to-RMI swap; deltas ensure no downtime during rebuilds.

  • ANN Lazy Build: Dev-friendly now; future: Persistent, tunable (M/ef), background refreshes.

Trade-offs: Current in-memory focus prioritizes simplicity over distribution—roadmap fixes that.


Roadmap to MVP

  • RMI v1 (production path)

    • Train 2‑stage linear models (root + leaves) on sorted key→offset pairs.

    • Serialize coefficients + ε per leaf; bounded probe search path with delta overlay.

    • Rebuild/swap operations; metrics (hit rate, ε distribution, lookup p50/p99).

  • ANN v1

    • Expose ef_build/ef_search and M parameters.

    • Background rebuild/refresh and optional persistence.

    • Benchmarks on ANN‑Benchmarks subsets; recall/latency targets.

  • DDL/schema

    • Explicit CREATE VECTORS name (dim INT), per‑table dims, DROP.
  • Autotuner (one knob)

    • Heuristic loop to tune broadcast buffer/cache; metrics‑driven.
  • Compaction/GC

    • Latest‑value per key snapshotting; bounded WAL growth under continuous writes.
  • Security and packaging

    • Token auth, TLS termination, Docker image; optional Helm chart.
  • Samples/benchmark

    • KV microbench (YCSB‑like); vector microbench; semantic search demo.

Why this matters

For teams building AI/semantic search and event-driven systems:

Unified ingestion + storage: Durable, crash-safe log with real-time subscribe simplifies pipelines (no Kafka + DB + vector DB sprawl for small-to-mid scale).

Vector + SQL together: Exact and ANN search with SQL surface reduces glue code when building RAG/search features.

Performance/Cost: Learned index can reduce memory and improve lookup latency; WAL+snapshot keeps the write path simple and robust.

Operations simplicity: One binary with metrics, snapshots, auto-WAL truncation, and soon autotuning-lower operational overhead vs multi-system stacks.

In a world of agentic AI and exploding data, KyroDB will lower barriers for RAG, analytics, and more.

💡
Try KyroDB today with exact search; flip feature flags for ANN and learned index. Star/fork the repo: here
💡
Email me at kishanvats2003@gmail.com or connect on Twitter, LinkedIn. What do you think of AI-native DBs? Comment below!
💡
NOTE:- The project is under fast-paced development, and thus, the code repository gets changes pushed almost every day. So, to get the most updated info on the project and the quickstart guide, refer to the README file of the project( available here )

Will be launching the MVP in a week or two. Thank you very much for reading till the last. If you liked it, then drop a like or comment(You can wave me a 👋 here), stay tuned, and keep learning.

3
Subscribe to my newsletter

Read articles from Kishan Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Kishan Kumar
Kishan Kumar

Engineer at Scale AI, building applications that scale and solve real-world problem