Back-of-the-Envelope Calculations for Interviews

The conference room was cold, as they always are. Across the table sat a candidate, let's call her Sarah. Sharp, articulate, and flew through the coding rounds. Now, we were in the system design session. I sketched a simple box on the whiteboard. "Let's design a basic photo-sharing service," I began. "Something like a simplified Instagram. Users can upload photos and follow other users to see their feed. Let's start with the scale. Assume we have 100 million users and 10 million daily active users."
Sarah nodded, confidently grabbing a marker. She drew boxes for a load balancer, web servers, a database. Standard stuff. Then came the question I always ask. "Okay, looks like a reasonable start. Now, let's talk numbers. Roughly how much storage would you need for the photos alone, say, for the first year?"
A pause. The confident posture wavered. "Well," she started, "that would depend on the type of database we use, and the replication factor, and..." She trailed off.
"Assume a standard object store like S3," I prompted. "Just give me a rough, back-of-the-envelope number."
The silence stretched. She could design the components, she knew the patterns, but she couldn't ground them in reality. She couldn't connect the abstract boxes on the board to the physical constraints of servers, disks, and network pipes. She couldn't do the math.
This scene, or a variation of it, plays out constantly in interview rooms and, more dangerously, in architecture planning meetings. We, as an industry, have become incredibly adept at discussing complex patterns like microservices, event sourcing, and CQRS. Yet, we often fail to ask the most fundamental question: what is the magnitude of the problem we are solving? The common wisdom is to focus on the abstract architecture first. My thesis is that this is backward and dangerous. Back-of-the-envelope calculation is not a party trick for interviews; it is the primary tool for architectural validation, and the most potent weapon against the pervasive disease of over-engineering.
Unpacking the Hidden Complexity: The Physics of Software
The reluctance to engage with numbers is understandable. It feels messy, imprecise. "It depends" is a safe, intellectually honest answer. But it's also the beginning of an inquiry, not the end. An architect who cannot estimate is like a civil engineer who cannot estimate the load-bearing capacity of a steel beam. They can draw a beautiful blueprint, but they have no idea if the bridge will stand or collapse.
The naive approach is to believe that our infrastructure is infinitely scalable. A junior engineer sees AWS or Google Cloud as a magical abstraction layer that handles scale. A senior engineer knows it's just someone else's computers, and those computers are governed by the same laws of physics as the one on their desk. Latency is still bound by the speed of light. A CPU core can only execute so many instructions per second. A disk can only perform a finite number of IOPS.
This is the core of the problem: we've forgotten the physics of our craft. We discuss patterns without understanding their physical cost. The second-order effect is catastrophic. Teams choose globally-distributed databases for applications with a purely regional user base, incurring massive latency and monetary costs. They build complex, event-driven microservice architectures for problems that could be solved by a single, well-provisioned server and a monolith, drowning themselves in operational overhead and cognitive load.
Think of it like this: A master chef understands the fundamental properties of their ingredients. They know that fat carries flavor, acid cuts through richness, and heat transforms texture. They don't need a detailed recipe to know that a dish needs a squeeze of lemon or a knob of butter. They have an intuition, an ingrained understanding of the "physics" of cooking. Back-of-the-envelope calculations are our way of understanding the physics of software. They are the foundation of architectural intuition.
To build this intuition, you don't need to memorize a thousand numbers. You just need to internalize a few key orders of magnitude. These are the "primary ingredients" of system design.
The Numbers You Must Internalize
Keep these numbers in your head. They are your reference points for sanity-checking any design.
Category | Operation | Typical Latency/Time | Analogy (Human Scale) |
CPU/Memory | L1 Cache Reference | ~0.5 ns | Grabbing a tool from your belt |
L2 Cache Reference | ~7 ns | Grabbing a tool from your toolbox | |
Main Memory (RAM) Reference | ~100 ns | Walking to a shelf in your garage | |
Storage | Read 1 MB sequentially from SSD | ~250 µs (0.25 ms) | Walking to your neighbor's house |
Read 1 MB sequentially from HDD | ~1,000 µs (1 ms) | Walking to the corner store | |
Disk Seek (HDD) | ~10 ms | Driving across town | |
Networking | Round Trip within same Datacenter | ~500 µs (0.5 ms) | A quick flight to a nearby city |
Round Trip USA to Europe | ~150 ms | A flight across the Atlantic Ocean |
Looking at this table, one thing becomes screamingly obvious: a network round trip is the great chasm. An operation that has to cross from California to the Netherlands is millions of times slower than an operation that happens in a CPU's cache. This single fact should fundamentally shape how you think about distributed systems. Every network call you add is not a small cost; it's a monumental one.
The Power of Two and The Rule of 72
Beyond latency, you need a quick way to think about scale and growth. Forget complex formulas. Use powers of two for data sizes and the "Rule of 72" for growth.
Powers of Two: Know them up to a reasonable point.
- 2^10 = 1,024 ≈ 1 Kilo (Thousand)
- 2^20 ≈ 1 Mega (Million)
- 2^30 ≈ 1 Giga (Billion)
- 2^40 ≈ 1 Tera (Trillion)
- This helps you quickly translate bits into bytes, kilobytes, megabytes, and beyond. A 64-bit integer (8 bytes) for an ID seems small. But for a billion rows, that's 8 GB of just IDs.
The Rule of 72: A simple way to estimate doubling time for a system growing at a certain percentage.
Years to Double = 72 / (Annual Growth Rate %)
- If your data is growing at 20% per year, it will double in approximately 72 / 20 = 3.6 years. This is invaluable for capacity planning.
Now, let's put these numbers to work.
The Pragmatic Solution: A Framework for Estimation
Thinking on your feet in an interview or a design meeting isn't about magic. It's about having a structured approach. When faced with a scaling question, don't panic. Follow a simple, repeatable framework. This framework forces you to ask the right questions and focus on the dominant constraints.
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#e3f2fd", "primaryBorderColor": "#1976d2", "lineColor": "#333", "textColor": "#212121"}}}%%
flowchart TD
subgraph "Phase 1: Clarify and Deconstruct"
A[Start with the Prompt] --> B{What are the core nouns and verbs};
B --> C[Clarify Ambiguous Requirements];
C --> D[Identify Read vs Write Patterns];
end
subgraph "Phase 2: Estimate the Scale"
D --> E[Estimate Queries Per Second QPS];
E --> F[Estimate Data Size per Unit];
F --> G[Calculate Daily Data Volume Ingress];
G --> H[Project Total Storage Over Time];
end
subgraph "Phase 3: Calculate the Resources"
H --> I{What is the main bottleneck};
I -- Bandwidth --> J[Calculate Network Egress];
I -- Storage --> K[Calculate Disk Space SSD vs HDD];
I -- Compute CPU Memory --> L[Estimate Server Count];
end
subgraph "Phase 4: Sanity Check"
J --> M[Review and State Assumptions];
K --> M;
L --> M;
end
This diagram outlines a four-phase mental model for any back-of-the-envelope calculation. Phase 1 is about understanding the problem. Phase 2 is about quantifying the load. Phase 3 translates that load into physical resources, forcing you to identify the primary bottleneck. Phase 4 is the crucial final step of reviewing your work and clearly stating the assumptions you made, which is what separates a wild guess from a reasoned estimate.
Mini-Case Study: Designing "TinyLink," a URL Shortener
Let's apply this framework to a classic system design problem: a URL shortener.
The Prompt: Design a service that takes a long URL and returns a short, unique one.
Phase 1: Clarify and Deconstruct
- Core Nouns: User, Long URL, Short URL (or "link").
- Core Verbs:
createLink
,getLink
(redirect). - Clarifying Questions:
- What is the expected traffic? Let's assume 100 million new links created per month.
- What's the read/write ratio? URL shorteners are usually read-heavy. Let's assume a 10:1 read-to-write ratio.
- How long do links need to last? Let's say forever.
- Are custom URLs allowed? Let's say no for simplicity.
Phase 2: Estimate the Scale
This is where we do the math. Don't be afraid to use approximations. We're looking for the order of magnitude.
Write QPS (Queries Per Second):
- 100 million writes / month
- 100,000,000 / (30 days 24 hours 3600 seconds/hour)
- 100,000,000 / (30 86,400) ≈ 100,000,000 / 2,592,000 ≈ *~40 writes/sec.
- This is an average. Peak traffic might be 2-3x higher, so let's plan for ~100 writes/sec.
Read QPS:
- 10:1 read/write ratio means 10 40 writes/sec = *~400 reads/sec on average.
- Let's plan for a peak of ~1000 reads/sec.
Storage Estimation:
- What data do we need to store for each link?
short_key
: 6-8 characters. Let's say 8 bytes.original_url
: URLs can be long. Let's average 500 bytes.user_id
: 8 bytes (64-bit integer).created_at
: 8 bytes.
- Total per link: ~524 bytes. Let's round up to ~0.5 KB per link.
- Total storage per month: 100 million links * 0.5 KB/link = 50 million KB = 50 GB.
- Total storage per year: 50 GB/month 12 months = *600 GB/year.
- Total storage over 5 years: 600 GB 5 = *3 TB.
- What data do we need to store for each link?
The data flow for creating a link and its impact on storage can be visualized.
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryBorderColor": "#7b1fa2"}}}%%
flowchart TD
classDef client fill:#e1f5fe,stroke:#1976d2,stroke-width:2px
classDef service fill:#fffde7,stroke:#fbc02d,stroke-width:2px
classDef storage fill:#fce4ec,stroke:#c2185b,stroke-width:2px
A[User Request POST longUrl]
B[Application Server]
C[Key Generation Service]
D[Database Write]
E[Primary Table 500B per row]
F[Index on ShortKey 16B per row]
A --> B
B --> C
C -- Returns unique shortKey --> B
B -- Writes shortKey originalUrl --> D
D --> E
D --> F
class A client
class B,C service
class D,E,F storage
This diagram shows that for every write request, we interact with a key generation service and then perform a database write. The critical insight for storage calculation is that the write impacts not just the primary data table (E
), but also any indexes (F
). While the primary row is ~500 bytes, the index might only be 16 bytes (e.g., the short key and a pointer to the main row). For a read-heavy system, this index is crucial, and its size matters. 3 TB over 5 years is not a trivial amount, but it's certainly manageable. It doesn't scream "we need a petabyte-scale distributed file system."
Phase 3: Calculate Resources
Now we connect our estimates to real hardware and services.
Storage: 3 TB is well within the capacity of a single modern database server using SSDs. We would want replication for durability, but the total data size itself doesn't force a sharded architecture from day one. An RDS or managed Postgres/MySQL instance could handle this for years.
Read Latency & Memory: This is the most critical part for user experience. A redirect should be fast.
- Our peak read QPS is ~1000. Can we serve this from memory?
- Let's analyze the read path.
sequenceDiagram
actor User
participant Browser
participant LoadBalancer
participant AppServer
participant Cache as In-Memory Cache Redis
participant Database
User->>Browser: Clicks tiny.link/abcdef
Browser->>LoadBalancer: GET /abcdef
LoadBalancer->>AppServer: GET /abcdef
AppServer->>Cache: GET abcdef
alt Cache Hit
Cache-->>AppServer: returns longUrl
else Cache Miss
AppServer->>Database: SELECT longUrl WHERE shortKey=abcdef
Database-->>AppServer: returns longUrl
AppServer->>Cache: SET abcdef longUrl
end
AppServer-->>Browser: 301 Redirect to longUrl
Browser-->>User: Navigates to original long URL
This sequence diagram illustrates the read path. The key to low latency is the in-memory cache (like Redis or Memcached). Can we fit our "hot" data set in the cache?
- Let's assume a Pareto principle (80/20 rule): 80% of reads go to 20% of the links.
- Total links after 5 years: 100M/month 12 5 = 6 billion links.
- 20% of 6 billion is 1.2 billion links.
- Data to cache per link:
short_key
(8 bytes) +long_url
(500 bytes) ≈ 508 bytes. - Total cache size needed for hot set: 1.2 billion * 508 bytes ≈ 609.6 GB.
This is a critical finding. 600+ GB is a lot of RAM. It's achievable with a cluster of cache servers, but it's not cheap. This calculation immediately tells us that a simple "cache everything" strategy might be too expensive. It forces us to ask better questions:
- Maybe we can use a more memory-efficient format in the cache?
- Maybe the average URL is much shorter than 500 bytes? (This is why clarifying assumptions is key!)
- Maybe we only cache the most popular 1% of links, not 20%.
This simple calculation has guided us from a vague "use a cache" to a specific, data-driven discussion about caching strategy and cost.
Traps the Hype Cycle Sets for You
Armed with this framework, you can now spot architectural anti-patterns that arise from resume-driven development or chasing trends.
The "We Need Microservices" Trap: Our URL shortener has a write QPS of ~100 and a read QPS of ~1000. A single modern server can easily handle thousands, if not tens of thousands, of requests per second. The problem is I/O bound (database and cache access), not CPU bound. Splitting this into a
LinkCreationService
and aRedirectService
at the outset adds network latency, deployment complexity, and operational overhead for zero tangible benefit. A simple monolith would be faster, cheaper, and easier to manage initially.The "Globally Distributed Database" Trap: Someone might suggest using Spanner or CockroachDB to serve redirects with low latency worldwide. But what did our calculation show? The bottleneck is the redirect itself, which is a single HTTP round trip. Our service's job is just to return a
301 Moved Permanently
header with theLocation
. The user's browser then makes a new request to the destination URL. The latency of our service is dwarfed by the latency of the user navigating to the final page. A better solution is to deploy read replicas of our database and cache in different regions, served by a geo-DNS, rather than using a complex and expensive globally active database. The calculation grounds us in what actually matters for user-perceived latency.The "Big Data" Trap: "We have billions of records, so we need Kafka, Spark, and a data lake!" Our 5-year estimate was 3 TB. This is not "small" data, but it is certainly not "big data" in the sense that it requires a massive distributed processing pipeline. A single Postgres or MySQL instance with proper indexing can handle 3 TB effectively. Reporting and analytics can be done on a read replica without impacting production traffic. Don't reach for the Hadoop-sized hammer when a regular claw hammer will do the job.
Architecting for the Future: Your First Move on Monday Morning
The ability to perform these quick calculations is not a static skill. It's a muscle that needs to be exercised. It's the difference between being a system "assembler" who just connects pre-built components and a true system "architect" who understands the trade-offs at a fundamental level.
Your goal is not to be perfectly accurate. It's to be in the right ballpark. Is the problem measured in gigabytes or petabytes? In hundreds of QPS or hundreds of thousands? The order of magnitude is what dictates the architecture. Getting this right is the single most important step in designing a system that is both scalable and maintainable.
So, what is your first move on Monday morning?
Pick a single, critical service that your team owns. Close your monitoring dashboards. Put away the infrastructure cost reports. Take out a piece of paper or open a blank text file. From first principles, try to estimate its core metrics:
- Average and peak QPS.
- Daily data storage growth.
- The size of its "hot" dataset in memory.
- Its monthly cloud bill.
Then, open the dashboards and compare. Where were you right? Where were you off by a factor of 10 or 100? Why? Did you misjudge the average payload size? Did you forget about log generation? Did you underestimate the cost of network egress? This exercise, repeated over time, is how you build true architectural intuition. It's how you go from knowing the name of a tool to understanding its cost and purpose.
I'll leave you with this question: The next time someone proposes a new architecture, will you be the one nodding along with the buzzwords, or will you be the one who pulls out a pen, scribbles for 30 seconds, and asks, "Have we considered that this will require 50 terabytes of RAM to be effective?"
TL;DR: Too Long; Didn't Read
- Core Idea: Back-of-the-envelope calculations are not just for interviews; they are a fundamental tool to fight over-engineering and build architectural intuition.
- Know Your Numbers: Internalize key latency figures (CPU, RAM, SSD, Network) and data size conversions (powers of two). These are the physical constraints of your system.
- Use a Framework: Don't guess. Follow a structured approach: 1) Clarify requirements, 2) Estimate scale (QPS, storage), 3) Calculate resources (servers, bandwidth), and 4) Sanity-check your assumptions.
- Focus on the Bottleneck: The goal of estimation is to find the dominant constraint. Is it storage, compute, network, or latency? Design for that constraint first.
- Case Study Example: A URL shortener with 100M writes/month needs ~40 write QPS and grows by ~600 GB/year. The read cache size (~600 GB for the hot set) is a more significant architectural driver than the write QPS or raw storage growth.
- Avoid Hype: Use your calculations to challenge assumptions. Do you really need microservices for 1000 QPS? Is a globally distributed database necessary when you can use regional read replicas?
- Actionable Advice: Practice on your own systems. Estimate their performance and cost from first principles, then compare with reality to hone your intuition.
Subscribe to my newsletter
Read articles from Felipe Rodrigues directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
