Database Replication: Master-Slave vs Master-Master

Felipe RodriguesFelipe Rodrigues
17 min read

In the high-stakes arena of senior engineering interviews, the system design round often stands as the ultimate proving ground. It's where theoretical knowledge meets practical application, where abstract concepts must translate into tangible, scalable architectures. A recent industry report revealed that over 65% of hiring managers for senior software roles consider system design skills to be the most critical differentiator among candidates. Yet, many experienced engineers, despite their technical prowess, struggle to articulate a coherent, comprehensive design under pressure. They might jump straight to a solution, overlook crucial non-functional requirements, or fail to manage their time effectively.

This article aims to equip you, the seasoned backend engineer and aspiring architect, with a robust, repeatable framework for navigating any system design interview. We will dissect the process into actionable phases, providing deep technical insights, strategic decision-making criteria, and practical tips to transform your approach from reactive problem-solving to proactive, structured system design. You will learn not just what to design, but how to think, communicate, and iterate like a true system architect, ensuring you leave the interviewer with a clear understanding of your capabilities and a strong impression of your design acumen.


The Architect's Blueprint: A Comprehensive System Design Interview Framework

System design interviews are not about finding the "perfect" solution; they are about demonstrating your thought process, your ability to make informed trade-offs, and your command over the fundamental principles of distributed systems. Approaching these interviews with a systematic framework transforms a daunting challenge into a manageable, even enjoyable, problem-solving exercise. Our framework comprises four distinct, yet interconnected, phases: Clarify Requirements, High-Level Design, Deep Dive & Refinement, and Operational & Edge Cases.

Phase 1: Clarify Requirements – The Foundation of Design

The most common pitfall in system design interviews is rushing into a solution without fully understanding the problem. This phase is about asking insightful questions to define the problem space comprehensively. Think of it as a discovery session with a product manager.

Functional Requirements: What Does the System Do?

Start with the core features. What are the primary user interactions?

  • User Management: Registration, login, profile management.
  • Core Functionality: For a URL shortener, it's shorten(long_url) -> short_url and redirect(short_url) -> long_url. For a chat system, it's sending/receiving messages, creating groups.
  • Data Types: What kind of data will the system store and process? Text, images, videos?
  • User Roles: Are there different types of users (e.g., admin, regular user, premium user)?

Example: For designing a "Twitter-like" feed:

  • Users can post tweets (max 280 characters).
  • Users can follow other users.
  • Users see a personalized timeline of tweets from people they follow.
  • Tweets can contain text, images, or videos.
  • Users can like, retweet, and reply to tweets.

Non-Functional Requirements: How Well Does the System Perform?

These are often overlooked but are critical in shaping the architecture. Quantify them where possible.

  • Scale (Traffic & Storage):
    • QPS (Queries Per Second): How many read/write requests per second? "Twitter has 330 million monthly active users, sending 500 million tweets per day. This translates to ~5,700 tweets/sec." This is the peak write QPS. Read QPS (timeline fetches) could be significantly higher, perhaps 10x or 100x writes.
    • Data Volume: How much data will be stored? "If each tweet is ~2KB (text + metadata), 500M tweets/day is 1TB/day. Over 5 years, that's 1.8PB of data."
    • Concurrent Users: How many users are active simultaneously?
  • Latency:
    • Read Latency: How quickly should a user's timeline load? "Sub-200ms for timeline fetch for a smooth UX."
    • Write Latency: How quickly should a tweet be posted? "Sub-50ms for tweet post."
  • Availability: How much downtime is acceptable? "High availability, 99.99% (four nines) or 99.999% (five nines) for core services." This means minutes of downtime per year.
  • Consistency: What level of consistency is required? Strong, eventual, or causal? "Timeline views can be eventually consistent, but user profiles or follower counts should be strongly consistent."
  • Durability: How resilient is the data to loss? "Data must be durable, no data loss."
  • Reliability: How well does the system handle failures?
  • Security: Authentication, authorization, data encryption (in transit and at rest).
  • Cost: Budget constraints for infrastructure.
  • Maintainability/Operability: Ease of deployment, monitoring, debugging.

Trade-offs: This is where the CAP theorem often comes into play. For a globally distributed system like Twitter, strong consistency across all data might be impossible or prohibitively expensive while maintaining high availability and partition tolerance. Eventual consistency for timeline delivery is often acceptable.

Phase 2: High-Level Design – The Architectural Blueprint

Once requirements are clear, sketch out the major components and their interactions. This phase is about establishing the macro-architecture.

API Design & Communication Protocols

  • Define the main API endpoints. RESTful APIs are common for web services, but consider GraphQL for flexible data fetching or gRPC for high-performance, low-latency microservice communication.
    • POST /tweets (create tweet)
    • GET /users/{id}/timeline (fetch user timeline)
    • POST /users/{id}/follow (follow a user)
  • Choose between synchronous (request-response) and asynchronous (event-driven) communication.

Core Components Identification

Identify the fundamental building blocks:

  • Clients: Web, mobile apps.
  • API Gateway/Load Balancer: Entry point, traffic distribution.
  • Services: Break down functionality into logical services (e.g., User Service, Tweet Service, Timeline Service, Notification Service). This hints at a microservices architecture.
  • Databases:
    • Relational (SQL): For structured data with strong consistency and complex relationships (e.g., User Service, Follower Graph). PostgreSQL, MySQL.
    • NoSQL: For high-volume, flexible data, or specific access patterns.
      • Key-Value Store: Redis (caching, rate limiting).
      • Document Database: MongoDB, Cassandra (for large-scale data like tweets or user feeds).
      • Graph Database: Neo4j (for complex relationships like follower graphs, though often modeled in relational DBs or specialized NoSQL).
  • Caches: For frequently accessed data to reduce database load and improve latency (e.g., Redis, Memcached).
  • Message Queues: For asynchronous processing, decoupling services, buffering writes (e.g., Kafka, RabbitMQ, SQS).
  • Storage: Object storage for media (S3).
  • Content Delivery Network (CDN): For static assets and media (Akamai, Cloudflare).

Data Model & Schema Design

Sketch out simplified schemas for critical entities.

  • User: id, username, email, password_hash, profile_picture_url
  • Tweet: id, user_id, text, media_urls, timestamp, likes_count, retweets_count
  • Follows: follower_id, followee_id (composite primary key)

Capacity Estimation

This is crucial for sizing the system.

  • Storage: (As calculated earlier) 1.8PB for 5 years of tweets.
  • QPS:
    • Tweet writes: ~5,700/sec (peak).
    • Timeline reads: If average user fetches timeline 10 times a day, and 330M active users, that's 3.3B reads/day, or ~38,000 reads/sec (average). Peak could be significantly higher.
  • Network Bandwidth: Based on QPS and average request/response size.
  • CPU/Memory: Estimate based on expected processing per request.

Example: Twitter's Core Components

  • Tweet Service: Handles tweet creation, storage, and retrieval. Likely uses a sharded NoSQL database (e.g., Cassandra for its high write throughput and scalability).
  • User Service: Manages user profiles, authentication. Could use a relational database for strong consistency.
  • Follower Service: Manages follower relationships. Can be a relational DB or a specialized graph database.
  • Timeline Service: The most complex. Generates personalized feeds. This involves fan-out strategies.

Phase 3: Deep Dive & Refinement – The Nitty-Gritty Architecture

Select one or two critical components identified in HLD and delve into their internal workings. This demonstrates your ability to think deeply about specific challenges.

Scalability Strategies

  • Horizontal Scaling: Adding more machines to distribute load. Requires stateless services.
  • Sharding/Partitioning: Distributing data across multiple database instances.
    • Hash-based: Distribute based on user_id or tweet_id.
    • Range-based: Distribute based on a range of IDs or timestamps.
    • Directory-based: A lookup service maps data to shards.
    • Consistency Hashing: For distributed caches or data stores.
  • Replication: Maintaining multiple copies of data for high availability and read scalability (master-slave, multi-master).
  • Load Balancing: Distributing requests across servers (Layer 4/7, DNS-based).
  • Data Indexing: For efficient data retrieval (e.g., inverted index for search, B-tree indexes for database lookups).

Caching Strategies

  • Where to Cache: Client-side, CDN, Gateway, Service-level, Database-level.
  • What to Cache: User profiles, popular tweets, frequently accessed timelines.
  • Caching Policies: LRU (Least Recently Used), LFU (Least Frequently Used), TTL (Time To Live).
  • Cache Invalidation: Write-through, write-back, refresh-ahead.
  • Cache Eviction: Handling cache misses.

Asynchronous Processing & Message Queues

  • Fan-out on Write (Push Model): When a user tweets, push the tweet to the inboxes of all their followers. This is common for high fan-out scenarios like Twitter, where most reads are timeline fetches.
    • Pros: Reads are fast (pre-computed).
    • Cons: High write amplification, potential for hot followers (celebrities).
  • Fan-out on Read (Pull Model): When a user requests their timeline, fetch tweets from all followed users and merge them.
    • Pros: Less write amplification, simpler writes.
    • Cons: Reads can be slow, especially for users following many people.
  • Hybrid Model: A combination, e.g., push for active users, pull for inactive users.
  • Message Queues (Kafka, RabbitMQ, SQS): Decouple services, buffer spikes, ensure reliable delivery, enable asynchronous processing (e.g., sending notifications, analytics processing).

Database Choices & Trade-offs

  • SQL (PostgreSQL, MySQL):
    • Pros: ACID properties, strong consistency, complex queries (JOINs), mature ecosystem.
    • Cons: Vertical scaling limits, sharding complexity.
    • Use cases: User data, follower graph (if not too complex), financial transactions.
  • NoSQL (Cassandra, MongoDB, DynamoDB):
    • Pros: Horizontal scalability, high availability, flexible schema, high write/read throughput.
    • Cons: Eventual consistency (often), limited JOINs, less mature tooling for complex queries.
    • Use cases: Tweet storage, timeline storage, activity feeds, large-scale sensor data.
  • Redis: In-memory data structure store.
    • Pros: Extremely fast reads/writes, supports various data structures (lists, sets, hashes).
    • Cons: Data persistence requires care, memory limits.
    • Use cases: Caching, session management, real-time leaderboards, rate limiting.

Example: Twitter Timeline Service Deep Dive

  • Problem: Generating personalized timelines for millions of users with low latency.
  • Solution: Hybrid Fan-out Model.
    • Tweet Posting: User posts a tweet. Tweet Service writes to a global Tweet Store (e.g., sharded Cassandra).
    • Fan-out to Inbox: A separate "Fanout Service" (triggered by a message queue like Kafka) reads the new tweet. For each follower, it inserts the tweet ID into the follower's "Inbox" (a personalized feed, also a sharded key-value store, e.g., Redis or another Cassandra table). For users with millions of followers (celebrities), this fan-out can be huge. Special handling: for these "hot" accounts, tweets might not be pushed to all followers' inboxes. Instead, their followers might "pull" directly from the celebrity's public feed or a dedicated cache.
    • Timeline Fetch: When a user requests their timeline, the Timeline Service fetches tweet IDs from their Inbox (very fast). Then, it fetches the full tweet content from the global Tweet Store (potentially cached).
  • Optimization:
    • Caching: User timelines, popular tweets are heavily cached.
    • Microservices: Dedicated services for fan-out, timeline generation, etc.
    • Load Balancing: Distribute requests across multiple instances of each service.

Architecture Diagrams Section

Visualizing the system is paramount. These diagrams help clarify components, data flow, and interactions.

Diagram 1: High-Level System Flow for a Social Media Platform

This diagram illustrates the general flow of a user interaction, from client request to data retrieval and asynchronous processing.

flowchart TD
    Client[Client Application] --> API[API Gateway]

    API --> |User Request| UserService[User Service]
    API --> |Tweet Post| TweetService[Tweet Service]
    API --> |Timeline Fetch| TimelineService[Timeline Service]

    UserService --> UserDB[(User Database)]
    TweetService --> TweetDB[(Tweet Database)]

    TimelineService --> TimelineCache{Redis Cache}
    TimelineCache --> |Cache Hit| Client
    TimelineCache --> |Cache Miss| TweetDB
    TweetDB --> TimelineCache
    TweetDB --> TimelineService
    TimelineService --> Client

    TweetService --> |Post Event| MessageQueue[Message Queue]
    MessageQueue --> FanoutService[Fanout Service]
    FanoutService --> TimelineDB[(Timeline Database)]
    TimelineDB --> TimelineCache

    style Client fill:#e1f5fe
    style API fill:#f3e5f5
    style UserService fill:#e8f5e8
    style TweetService fill:#fff3e0
    style TimelineService fill:#ffebee
    style UserDB fill:#f1f8e9
    style TweetDB fill:#f1f8e9
    style TimelineCache fill:#e0f2f1
    style MessageQueue fill:#fce4ec
    style FanoutService fill:#e3f2fd
    style TimelineDB fill:#f1f8e9

Explanation: This flow depicts a simplified social media platform. Client requests hit an API Gateway, which routes them to appropriate microservices (User, Tweet, Timeline). The Tweet Service writes to a dedicated Tweet Database and publishes an event to a Message Queue. A Fanout Service consumes this event, processing it to update user timelines stored in a Timeline Database, which is then cached by Redis. The Timeline Service fetches data primarily from this cache, falling back to the Tweet Database on a cache miss. This illustrates both synchronous request-response and asynchronous event-driven flows.

Diagram 2: Microservices Component Architecture

This diagram breaks down the system into logical microservices, highlighting their interdependencies and data stores.

graph TD
    subgraph Frontend Layer
        WebApp[Web Application]
        MobileApp[Mobile App]
    end

    subgraph API Layer
        APIProxy[API Gateway]
    end

    subgraph Core Services
        UserSVC[User Service]
        TweetSVC[Tweet Service]
        TimelineSVC[Timeline Service]
        FollowerSVC[Follower Service]
        NotificationSVC[Notification Service]
    end

    subgraph Data Stores
        UserDB[(User Database)]
        TweetDB[(Tweet Database)]
        TimelineDB[(Timeline Database)]
        FollowerDB[(Follower Database)]
        CacheStore{Redis Cache}
    end

    subgraph Asynchronous Processing
        MessageBus[Kafka Message Bus]
        FanoutWorker[Fanout Worker]
    end

    WebApp --> APIProxy
    MobileApp --> APIProxy

    APIProxy --> UserSVC
    APIProxy --> TweetSVC
    APIProxy --> TimelineSVC

    UserSVC --> UserDB
    TweetSVC --> TweetDB
    FollowerSVC --> FollowerDB
    TimelineSVC --> TimelineDB

    TweetSVC --> |Publishes Event| MessageBus
    MessageBus --> FanoutWorker
    FanoutWorker --> TimelineDB

    TimelineSVC --> |Reads from| CacheStore
    CacheStore --> TimelineDB

    NotificationSVC --> |Consumes Events| MessageBus
    NotificationSVC --> UserSVC

    style WebApp fill:#e8f5e8
    style MobileApp fill:#e8f5e8
    style APIProxy fill:#fff3e0
    style UserSVC fill:#e1f5fe
    style TweetSVC fill:#e1f5fe
    style TimelineSVC fill:#e1f5fe
    style FollowerSVC fill:#e1f5fe
    style NotificationSVC fill:#e1f5fe
    style UserDB fill:#f1f8e9
    style TweetDB fill:#f1f8e9
    style TimelineDB fill:#f1f8e9
    style FollowerDB fill:#f1f8e9
    style CacheStore fill:#e0f2f1
    style MessageBus fill:#fce4ec
    style FanoutWorker fill:#e3f2fd

Explanation: This diagram presents a more detailed component view of the social media platform. It categorizes components into logical layers: Frontend, API, Core Services, Data Stores, and Asynchronous Processing. It clearly shows how services interact, which services depend on which databases, and how asynchronous communication via a message bus facilitates decoupling and fan-out operations. The Fanout Worker, for example, processes events from the Message Bus to populate the Timeline Database.

Diagram 3: User Authentication Sequence

This sequence diagram illustrates the step-by-step process of a user logging in, demonstrating interaction between client, API gateway, authentication service, and user database.

sequenceDiagram
    participant Client as Client Application
    participant API as API Gateway
    participant Auth as Authentication Service
    participant UserDB as User Database

    Client->>API: POST /login (username, password)
    API->>Auth: Validate Credentials

    Auth->>UserDB: Query User by Username
    alt User Exists
        UserDB-->>Auth: User Data (hashed password)
        Auth->>Auth: Compare Passwords
        alt Password Match
            Auth-->>API: Generate JWT Token
            API-->>Client: 200 OK + Token
            Note over Client,Auth: User Authenticated
        else Password Mismatch
            Auth-->>API: 401 Unauthorized
            API-->>Client: 401 Unauthorized
        end
    else User Not Found
        UserDB-->>Auth: No User Data
        Auth-->>API: 401 Unauthorized
        API-->>Client: 401 Unauthorized
    end

    Client->>API: Subsequent requests with Token
    API->>Auth: Verify Token (on each request)
    Auth-->>API: Token Valid/Invalid

Explanation: This sequence diagram details the user authentication flow. The client sends login credentials to the API Gateway, which forwards them to the Authentication Service. The Auth Service queries the User Database to retrieve user details, then validates the password. Depending on the outcome, it either issues a JWT token (on success) or returns an error. Subsequent requests from the client include this token, which the API Gateway validates with the Auth Service to ensure the user is authorized.

Phase 4: Practical Implementation – Mastering the Interview

Knowing the framework is one thing; effectively applying it under interview pressure is another. This section covers how to execute each phase within the typical 45-60 minute interview slot, common pitfalls, and best practices.

Step-by-Step Application in a 45-Minute Interview

  1. 0-10 Minutes: Clarify Requirements & Scope

    • Goal: Understand the problem, define functional and non-functional requirements.
    • Action: Ask probing questions. "Who are the users?", "What's the expected traffic (QPS)?", "What's the acceptable latency?", "Is strong consistency needed?", "What are the core features?" Prioritize. Write them down.
    • Example: For a "Google Docs collaborative editor," clarify: real-time updates? Offline support? Version history? Concurrent users on one document?
    • Pitfall: Jumping to solutions. Not asking enough questions.
    • Tip: Start broad, then narrow down. Use numbers for non-functional requirements.
  2. 10-25 Minutes: High-Level Design

    • Goal: Sketch the major components, data flow, and API endpoints.
    • Action: Draw a block diagram on the whiteboard. Identify services (e.g., User Service, Document Service, Collaboration Service), databases (SQL/NoSQL choice), caches, message queues. Define key API endpoints (e.g., GET /document/{id}, PUT /document/{id}/content). Estimate QPS and storage requirements.
    • Example: For Google Docs, Document Service (stores content), Collaboration Service (handles real-time ops), User Service. Databases for documents (NoSQL for flexibility), user metadata (SQL).
    • Pitfall: Getting bogged down in low-level details too early. Not justifying component choices.
    • Tip: Think about the 80/20 rule: cover the most critical parts. Justify your choices with trade-offs.
  3. 25-40 Minutes: Deep Dive / Component Design & Scalability

    • Goal: Choose one critical component and elaborate on its internal design, scalability, and reliability.
    • Action: Pick the most challenging part (e.g., the real-time collaboration engine for Google Docs). Discuss:
      • Data Model: How are changes represented (CRDTs, operational transformations)?
      • Communication: WebSockets for real-time.
      • Scalability: Sharding strategies for documents, horizontal scaling of collaboration servers.
      • Consistency: How to ensure eventual consistency across clients.
      • Failure Modes: What happens if a server goes down? How are conflicts resolved?
    • Example: For Google Docs, discuss a "Delta Sync" approach where clients send small diffs, and the server merges them. Explain CRDTs or Operational Transformation for conflict resolution.
    • Pitfall: Not being able to go deep enough. Not considering failure scenarios.
    • Tip: Show your depth. Use specific architectural patterns (e.g., pub/sub, leader-follower). Talk about monitoring and alerting.
  4. 40-45 Minutes: Operational Aspects, Edge Cases & Future Considerations

    • Goal: Address robustness, cost, maintenance, and future growth.
    • Action: Briefly discuss monitoring, logging, alerting (Prometheus, Grafana). How would you deploy this (CI/CD, Kubernetes)? Backup and disaster recovery strategies. Security considerations (AuthN/AuthZ, encryption). Mention potential future features and how the design accommodates them.
    • Example: For Google Docs, discuss how to handle network partitions, data corruption, or large file sizes. How would you add commenting or versioning?
    • Pitfall: Running out of time before addressing these.
    • Tip: This shows you think beyond just the happy path. It demonstrates maturity and a holistic understanding of system lifecycle.

Common Pitfalls and How to Avoid Them

  • Pitfall 1: Jumping Straight to a Solution:
    • Avoid: Always start with clarifying questions. Write down requirements. This shows structured thinking and prevents designing the wrong system.
  • Pitfall 2: Neglecting Non-Functional Requirements:
    • Avoid: Explicitly ask about QPS, latency, availability, consistency. These drive architectural choices. Quantify them.
  • Pitfall 3: Not Justifying Decisions (No Trade-offs):
    • Avoid: Every architectural choice has pros and cons. Explain why you chose SQL over NoSQL for a specific component, or a push model over a pull model. "Given the requirement for X, and the constraint of Y, I chose Z because of A, B, and C, acknowledging the trade-off of D."
  • Pitfall 4: Getting Stuck on One Detail:
    • Avoid: Time management is key. If you're stuck, acknowledge it, propose a high-level solution, and move on. You can always revisit if time permits. The interviewer wants to see your breadth first, then depth.
  • Pitfall 5: Poor Communication/Whiteboard Skills:
    • Avoid: Draw clear, simple diagrams. Label components. Use arrows for data flow. Talk through your thought process aloud. Engage the interviewer; it's a conversation, not a lecture.
  • Pitfall 6: Ignoring Failure Scenarios:
    • Avoid: Think about what happens when a database goes down, a service crashes, or the network partitions. Discuss resilience strategies (retries, circuit breakers, idempotency).

Best Practices and Optimization Tips

  • Practice, Practice, Practice: Work through common system design problems (URL shortener, chat app, news feed, Uber, Netflix).
  • Mock Interviews: Get feedback from peers or mentors. This helps refine your communication and time management.
  • Stay Updated: Follow industry trends, learn about new technologies and architectural patterns. Understand why companies like Netflix or Uber chose certain architectures.
  • Focus on First Principles: Understand the underlying concepts (CAP theorem, distributed consensus, eventual consistency, caching patterns) rather than just memorizing specific technologies.
  • Prioritize: In a limited time, focus on the most impactful and challenging aspects of the system. It's better to design 80% of a critical component well than 20% of every component poorly.
  • Ask "Why": For every design decision, ask yourself "Why this?", and be ready to articulate the alternatives and their trade-offs.

Conclusion & Key Takeaways

Mastering the system design interview is less about memorizing solutions and more about internalizing a structured, adaptable problem-solving framework. By systematically moving through Clarifying Requirements, High-Level Design, Deep Dive & Refinement, and Operational & Edge Cases, you demonstrate not just your technical knowledge but also your ability to think critically, manage complexity, and communicate effectively – all hallmarks of a senior engineer and architect.

Key Decision Points to Remember:

  • Quantify Everything: Use numbers for traffic, storage, latency.
  • Justify Trade-offs: Explain the why behind your choices (e.g., consistency vs. availability).
  • Prioritize: Focus on the most critical components and non-functional requirements.
  • Handle Failures: Discuss resilience, monitoring, and disaster recovery.
  • Communicate Clearly: Use diagrams, articulate your thought process.

Actionable Next Steps:

  1. Internalize this Framework: Practice applying it to common system design problems.
  2. Review Core Concepts: Brush up on distributed systems fundamentals (CAP theorem, consistency models, sharding, caching, message queues).
  3. Study Real-World Systems: Research how companies like Netflix, Uber, and Amazon scaled their systems.
  4. Engage in Mock Interviews: Get constructive feedback to refine your approach and communication.

System design interviews are not just tests; they are opportunities to showcase your architectural prowess. By adopting a structured framework, you transform these challenges into a stage for demonstrating your expertise, paving your way to impactful senior engineering roles.


TL;DR: Ace system design interviews with a 4-phase framework: 1) Clarify Requirements (functional, non-functional, quantify); 2) High-Level Design (core components, APIs, data model, capacity estimates); 3) Deep Dive & Refinement (scalability, caching, async processing, database choices, trade-offs for a key component); 4) Operational & Edge Cases (monitoring, security, failure handling, future growth). Always justify decisions, quantify metrics, and communicate clearly. Practice with real-world examples and mock interviews.

0
Subscribe to my newsletter

Read articles from Felipe Rodrigues directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Felipe Rodrigues
Felipe Rodrigues