Implementing Read-Your-Write Consistency in Distributed Databases: A Deep Dive into Bitbucket's Approach

UJJWAL BALAJIUJJWAL BALAJI
5 min read

In the world of distributed databases, ensuring consistency while maintaining performance is one of the most challenging problems. One such consistency model is Read-Your-Write Consistency (RYWC), which guarantees that a user will always read their most recent write. While this concept is theoretically well-understood, implementing it in practice can be tricky, especially in systems with master-replica architectures.

In this blog, we’ll explore how Bitbucket, a popular code revisioning tool (similar to GitHub), tackled this problem to scale their database and improve performance. We’ll break down their approach, understand the underlying principles, and see how they implemented Read-Your-Write Consistency in a distributed PostgreSQL setup.

What is Read-Your-Write Consistency?

Before diving into the implementation, let’s first understand what Read-Your-Write Consistency means.

  • Strong Consistency: Every read operation returns the most recent write, regardless of which user or node performs the read.

  • Read-Your-Write Consistency (RYWC): A relaxed form of consistency where a user is guaranteed to read their own most recent write, but other users might see stale or outdated data.

For example, if User A writes a value to the database and immediately reads it, they should see the value they just wrote. However, User B might see an older value until the replication catches up.

The Problem: Replication Lag in Master-Replica Architectures

Bitbucket, like many distributed systems, uses a master-replica architecture for its database. Here’s how it works:

  1. Master Node: Handles all write operations and critical reads.

  2. Replica Nodes: Handle read operations that can tolerate some staleness.

The problem arises due to replication lag. When a write happens on the master, it takes some time for the change to propagate to the replicas. During this lag, if a user reads from a replica, they might not see their most recent write, violating Read-Your-Write Consistency.

Bitbucket’s Solution: Leveraging Log Sequence Numbers (LSN)

To solve this problem, Bitbucket implemented a clever solution using Log Sequence Numbers (LSN). Here’s how it works:

1. Understanding Replication in PostgreSQL

In PostgreSQL, replication works as follows:

  • The master node writes changes to a Write-Ahead Log (WAL) file.

  • Each entry in the WAL file is assigned a unique Log Sequence Number (LSN), which is monotonically increasing.

  • Replica nodes pull entries from the WAL file and apply them to their own copy of the data.

The key insight here is that the LSN provides a way to track the progress of replication. By comparing LSNs, we can determine whether a replica has caught up with a specific write.

2. Tracking User Writes with LSN

Bitbucket’s solution involves tracking the LSN of the most recent write for each user. Here’s the step-by-step process:

  1. Write Operation:

    • When a user performs a write operation, the master node commits the change and generates a new LSN.

    • The Django middleware (used by Bitbucket) captures this LSN and stores it in a Redis cache, mapping the user ID to their latest LSN.

  2. Read Operation:

    • When the same user issues a read request, the middleware retrieves the user’s latest LSN from Redis.

    • It then queries each replica to check its current LSN.

    • The middleware selects a replica that has an LSN greater than or equal to the user’s LSN, ensuring that the replica has the user’s most recent write.

    • If no such replica is found, the read is directed to the master node.

3. The Role of PostgreSQL Functions

To implement this, Bitbucket uses PostgreSQL’s built-in functions:

  • pg_current_wal_lsn(): Retrieves the current LSN on the master node after a write.

  • pg_wal_lsn_diff(): Calculates the difference between two LSNs to determine how far behind a replica is.

Here’s an example query used to check if a replica has caught up:

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replica_lsn) >= 0 AS is_caught_up;

This query returns true if the replica’s LSN is greater than or equal to the user’s LSN, indicating that the replica has the user’s latest write.

Benefits of This Approach

  1. Reduced Load on Master:

    • By directing reads to replicas that have caught up, Bitbucket significantly reduced the load on the master node.

    • According to their blog, this approach reduced the number of requests hitting the master by 50%.

  2. Improved Performance:

    • The overhead of querying Redis and checking replica LSNs is minimal (around 10 milliseconds per request).

    • This small overhead is acceptable given the performance gains and scalability benefits.

  3. Scalability:

    • By leveraging replicas more effectively, Bitbucket was able to handle millions of requests per hour without overloading the master node.

Challenges and Trade-offs

While this approach is effective, it’s not without its challenges:

  1. Redis Dependency:

    • The solution relies on Redis to store user-LSN mappings. If Redis goes down, the system might fail to ensure Read-Your-Write Consistency.
  2. Replica Lag:

    • If all replicas are lagging behind, reads will be directed to the master, increasing its load.
  3. Complexity:

    • Implementing this solution requires careful coordination between the application, database, and caching layer.

Key Takeaways

Bitbucket’s implementation of Read-Your-Write Consistency using LSNs is a great example of how to balance consistency and performance in a distributed database. Here are the key takeaways:

  1. Use LSNs to Track Replication Progress:

    • LSNs provide a reliable way to determine whether a replica has caught up with a specific write.
  2. Leverage Caching for Metadata:

    • Storing user-LSN mappings in Redis minimizes the overhead of tracking writes.
  3. Direct Reads to Appropriate Replicas:

    • By checking replica LSNs, you can ensure that reads go to replicas with the required data, reducing the load on the master.
  4. Accept Small Overheads for Big Gains:

    • A 10-millisecond overhead per request is a small price to pay for a 50% reduction in master node load.

Conclusion

Implementing Read-Your-Write Consistency in a distributed database is no small feat, but Bitbucket’s approach demonstrates that it’s both practical and effective. By leveraging PostgreSQL’s LSNs and Redis for caching, they were able to ensure consistency while significantly improving performance and scalability.

1
Subscribe to my newsletter

Read articles from UJJWAL BALAJI directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

UJJWAL BALAJI
UJJWAL BALAJI

I'm a 2024 graduate from SRM University, Sonepat, Delhi-NCR with a degree in Computer Science and Engineering (CSE), specializing in Artificial Intelligence and Data Science. I'm passionate about applying AI and data-driven techniques to solve real-world problems. Currently, I'm exploring opportunities in AI, NLP, and Machine Learning, while honing my skills through various full stack projects and contributions.