How We Built RisingWave on S3: A Deep Dive into S3-as-primary-storage Architecture


For years now, people have been hyping “S3 as primary storage.” You’ve seen the diagrams—separate compute and storage, store everything in the cloud, scale infinitely, etc. It sounds smooth. But if you’re actually building a enterprise-grade system that requires strong SLA, you’ll find out fast: it’s not that simple.
At RisingWave, we took the hard road. From the very beginning, we built a distributed streaming database that uses S3 not just for backup—but as the only storage layer. Internal state, materialized views, operator outputs, recovery logs—it all lives in S3.
This one decision forced us to engineer around every limitation you can imagine: high-latency reads, API request costs, lack of consistency, and the complete absence of traditional disk semantics.
But we did it. We built an architecture where S3-backed infrastructure can deliver the performance and consistency people expect from an in-memory system.
This blog lays out exactly how we pulled it off—step by step, mistake by mistake, and lesson by lesson. No sugarcoating, no hand-waving, and absolutely zero fluff.
S3 Is Object Storage. It's Not a Filesystem.
A lot of systems today were originally built on local disks or EBS volumes. Then someone comes along and says, "Let’s just move it to S3." Sounds simple. In reality? You’re in for a rewrite.
S3 isn’t a drop-in replacement for disk. It doesn’t support seek, append, or rename. It’s immutable, eventually consistent, and has high latency on every read and write. If your system relies on random access, sync writes, or file-level atomicity—S3 will break it.
RisingWave was designed from day one to work with S3 as the primary store—not just as a dump target. That’s why everything, from our state model to our compaction strategy to our recovery mechanism, assumes that persistent state lives in object storage.
We write everything as immutable blobs—materialized views, intermediate state, and operator outputs. They’re logically organized by table ID and epoch, versioned, and never updated in place. Metadata lives in Postgres, which tracks where each object belongs and how it should be recovered.
This architecture makes compute nodes truly stateless. When a node crashes or gets replaced, it just asks Postgres where to start and pulls what it needs from S3. No local RocksDB, no coordination protocol. That’s how RisingWave achieves fault tolerance and elasticity—by designing for S3 from day one, not trying to retrofit it in later.
The First Pain: Latency
In a streaming system, internal state is everything. Every operator—join, window, aggregation—depends on maintaining state across millions of events. But if you put that state directly on S3, you’re dead in the water.
S3 is slow. In the best case, you might see 30ms TTFB (Time to First Byte). In reality, you’re more likely to see 100–300ms, especially under load. Now imagine you’re processing 1,000 events per second and need to update or read state for each one. If every state read takes hundreds of milliseconds, your delay keeps growing, and your entire system backs up. Eventually it collapses.
Of course, you cache. Use memory. But memory is limited—most cloud machines top out at 32 to 64 GB unless you pay for the premium stuff. That’s nowhere near enough to support complex queries involving large windows or joins across massive streams.
So we introduced a middle layer: local disk. More specifically, NVMe SSD or EBS. This acts as a warm cache between RAM and S3.
Now, some people immediately dismiss EBS: “It’s way slower than local SSD.” That’s not always true—and it’s not that simple.
Here’s why EBS actually makes a lot of sense:
Not all instance types have local disks. On AWS, only specific instance families support NVMe. They’re often oversized for workloads with low CPU needs.
Disk sizing flexibility. With local SSD, you get whatever size the instance gives you. With EBS, you can scale disk size up or down independently. That means if a tenant is just running a small aggregation job, they don’t need to overpay for unused local disk.
Deployment portability. RisingWave runs anywhere—AWS, Azure, GCP, on-prem. Every environment has some form of EBS-style block storage. Local disk configs vary widely, making standardization harder.
Modern EBS is fast enough. With the right provisioning (like io2), EBS can deliver consistent, high-throughput IOPS that’s perfectly sufficient for intermediate state caching.
We built Foyer, a smart cache that lives on this disk layer. It handles eviction, telemetry, and integrates tightly with RisingWave’s query engine. Memory, SSD, S3—each tier plays its role. And S3 becomes a true long-term, immutable, and cost-effective backing store—not something we ever hit during hot query paths unless we absolutely have to.
Compaction Has to Be Remote
When your internal state lives across memory, disk, and S3, you need a mechanism to reorganize that data—especially on disk and S3—so reads stay fast and storage doesn’t explode. That’s what compaction does.
RisingWave uses an LSM-tree style architecture to manage this tiered storage. But instead of doing compaction locally like RocksDB does—on the same machine that’s handling user queries—we split it out entirely. RocksDB-style local compaction often creates resource contention: if your query is heavy and compaction kicks in, the two fight over CPU and I/O. Your system stutters. In a streaming workload, that’s fatal.
RisingWave solves this with remote compaction. Dedicated compaction workers pull state data from S3, rewrite and merge files, and push the results back—without touching the compute node. Query latency stays smooth, and compute nodes stay focused.
Still, dedicating full machines just for compaction can be wasteful, especially when workloads are spiky. So in RisingWave Cloud, we went one step further—we made compaction serverless. Different tenants and jobs share a compaction pool that scales automatically, so you get elasticity and efficiency without operational overhead.
For private deployments, you can still choose to run lightweight compactor nodes—small, efficient boxes that quietly handle the dirty work in the background.
This design—memory for speed, EBS or SSD for buffer, S3 for persistence, and remote compaction for reorganization—is how we keep RisingWave fast and stable, no matter the workload.
Not All S3 Access Is Equal
S3 pricing isn’t just about how much data you store—it’s about how often and how inefficiently you access it. Every single API call costs money. Doesn’t matter if it’s GET, PUT, LIST, or HEAD. If your system hits S3 a million times a day with tiny reads, you’ll get a surprise AWS bill that’s larger than your actual storage cost.
We’ve seen this in the wild. 1 million GET requests per day at $0.0004 each is already $400/month. Multiply that by 30 days and a few tenants, and you’re staring down thousands of dollars just to read your own data.
So we rewired the system to avoid this trap:
Block-level reads: We pack logical rows into 4MB blocks—the sweet spot for minimizing S3 GET costs while avoiding excessive prefetch. These can be fetched in a single request.
Sparse index: Every table, every view, every materialized result comes with a tiny index that maps logical key ranges to S3 object keys and byte offsets. These indexes live in memory or SSD and let us skip blindly probing S3.
Prefetching: During query planning, RisingWave analyzes scan patterns and fetches blocks in advance. If your query will need blocks 1–10, we’ll grab 2–4 in the background while returning 1.
In production, this setup cuts S3 roundtrips by 80–90% and keeps our per-query cost under control. It’s the only way to make S3 usage sustainable in a real-time system.
Real-Time Freshness and Consistency
You can’t fake freshness in a streaming system. Nobody wants to run a query and get results from 30 minutes ago. Users expect live data, and they expect it now. That means you can’t rely on periodic batch flushes—you need immediate consistency.
So here’s what we built:
Read-after-write consistency across the board
Low-latency propagation from write path to query nodes
Aggressive cache invalidation to keep data clean
Every time a write lands, it triggers an event that invalidates any affected blocks—both in memory and on disk. Materialized views are updated incrementally, so users don’t wait for full recomputation. Cold data is evicted fast, and stale snapshots don’t linger.
This architecture guarantees that if an event hits the system, any downstream query will reflect that update—within seconds. Not eventually. Immediately. That’s what makes RisingWave feel live.
Multi-Tenancy and Isolation
S3 fundamentally changes how you think about multi-tenancy. In traditional architectures, isolation often means spinning up separate clusters per tenant—each with its own disk, cache, and storage tier. But when everything is persisted to S3, the game changes.
Because RisingWave puts all persistent state in S3, tenants don’t compete for local disk. Each query, each materialized view, each table scan pulls from the same object store, backed by its own logical namespace. We don’t have to carve up disks or worry about one tenant trashing another’s storage buffers.
This gave us the foundation to design real isolation at the compute layer:
Per-query CPU limits, enforced by the execution engine
Memory quotas per tenant, with eviction strategies that tie into our SSD and S3 hierarchy
Independent I/O budgeting, so one noisy workload doesn’t overwhelm S3 prefetch or SSD cache
At the same time, all tenants access a shared metadata catalog. That means teams can collaborate on the same streams and materialized views, without duplicating data or stepping on each other’s performance.
With S3 as the consistent source of truth and RisingWave’s stateless execution layer, multi-tenancy becomes not just doable—but clean, cheap, and stable.
Cache Hydration and Peer Transfer
S3 is great for durability, but it’s not where you want your hot data coming from during live query traffic. That’s why we built RisingWave to proactively hydrate cache when updates happen.
Whenever a write lands—say a new batch of CDC events or an updated materialized view fragment—we notify the read path immediately. That triggers prefetch logic that warms up the disk and memory cache, so when the next query hits, it doesn’t have to wait for an S3 GET.
But there’s still room to go further. In the future, we plan to support peer-to-peer cache hydration—where one node can share hot blocks directly with another. Instead of every read node independently reaching out to S3, they’ll be able to fetch the updated block from a nearby peer that just wrote it. Think of it as an internal CDN, optimized for low-latency event propagation.
This kind of mechanism is critical in a world where compute is elastic and S3 is expensive to hit under pressure. The less you rely on object store for hot reads, the better your system behaves at scale.
Final Thoughts: It’s Worth It
Building a streaming database that uses S3 as primary storage is not just hard—it’s brutal. You have to rethink every layer of the system: caching, state management, compaction, query freshness, coordination—all of it. And none of the existing assumptions about disk, memory, or file I/O apply.
But if you pull it off, the payoff is massive:
You can restart any node instantly, without data loss
You scale naturally across regions and clouds
You don’t worry about disks filling up or SSD failures
You get full elasticity and durability at object storage prices
And best of all: you can still hit memory-speed performance thanks to smart caching and remote compaction
We’ve built a system where S3 handles cold persistence, EBS or local disk handles warm state, and memory handles the hot path—coordinated by a compaction engine that reorganizes data behind the scenes. The result feels like an in-memory system, but operates with cloud economics.
If you want to see what that looks like in production, give it a spin. risingwave.com
We’ve spent four years building this architecture so you don’t have to.
Subscribe to my newsletter
Read articles from Yingjun Wu directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
