Replication Insights and Notes

Welcome back folks.

We are going to continue from where we left off in the last blog. We discussed Replication Logs, so now we are going to dive into the Problems with Replication Logs.

Problems with Replication Logs

Let's say you have a read heavy system, so to scale you just have to add more read replica's right? Yes. But in any system, problems arise. You cannot have sync replication of data here because then the latency will increase a lot, also if one node goes down then the writes will get blocked indefinitely.

Only async replication will work in this scenario, but the replication lag may vary from fraction of seconds to minutes. This leads to the follower falling behind the actual leader and can lead to inconsistencies in data, eventually the follower will catch up and hence this effect is called "eventual consistency". So lets see how to solve them:

Read-After-Write Consistency

Ensures a client reads the latest data version immediately after writing it
Useful when there is replication lag between the leader (where writes occur) and followers (where reads occur)

Strategies for Implementing Read-After-Write Consistency

Read from the Leader: When a user reads data they may have recently modified, read directly from the leader. Example: In social networks, users read their own profiles from the leader and others' profiles from followers
Use Last Update Time: If a user updated data within a recent timeframe (e.g., the last two seconds), read from the leader
Track Last Write Timestamp: Save the timestamp of the last write and read from the leader until the data reflects updates up to that timestamp

Monotonic Reads

Ensures a user does not see older data after having seen newer data, avoiding "time travel" effects due to replication lag
Provides a weaker guarantee than strong consistency but is stronger than eventual consistency
How to make sure the above two things happen?

Ensure the user always reads from the same replica

Consistent Prefix Reads

Ensures the correct order of write operations when there is causality in data (e.g., messages in a chat app)
Prevents anomalies like broken message sequences caused by reading from different replicas
Common Issue: Occurs more frequently in sharded databases
Solutions:
- Write causally related data to the same shard.
- Use specialised algorithms to maintain the correct order

Solving Replication Lag

Replication lag is challenging to manage in applications; do not assume replication is synchronous when it is actually asynchronous
Use Transactions: Transactions help provide stronger consistency and reduce complexity in application code

Multi-Leader Replication

Concept:
- Unlike leader-based replication (single leader for all writes), multi-leader replication allows multiple nodes to accept writes simultaneously. Each leader also acts as a follower to other leaders.
Use Cases:
- Rarely useful within a single datacenter.
- Multi-datacenter operation: Each datacenter has its own leader; leaders replicate changes asynchronously to other data centers.
  - Benefits:
    - Performance: Local writes reduce latency; network delays are hidden from users.
    - Datacenter outage tolerance: Data centers can operate independently.
    - Network problem tolerance: More resilient to network issues with asynchronous replication.
Implementation:
- Tools: Tungsten Replicator (MySQL), BDR (PostgreSQL), GoldenGate (Oracle).
- Challenges: Configuration issues (e.g., auto incrementing keys, triggers), potential data conflicts.
Specific Use Cases:
- Offline Clients: Applications (like a calendar) that work offline use local databases and synchronise later.
- Collaborative Editing: Real-time editing (e.g., Google Docs) involves local changes replicated asynchronously to the server and other users.

Handling Write Conflicts

Key Challenge:
- Conflict resolution is more complex in multi-leader replication compared to single-leader replication.
Conflict Detection:
- Synchronous: Conflicts detected immediately (single-leader).
- Asynchronous: Conflicts detected later (multi-leader).
Strategies for Conflict Handling:
- Conflict Avoidance: Route all writes for a particular record to the same leader.
- Convergent State: Ensure all replicas eventually reach the same final value.
Conflict Resolution Methods:
- Last Write Wins (LWW): Keep the write with the highest ID.
- Replica Precedence: Writes from a higher-numbered replica take precedence.
- Merge Values: Combine values in some way.
- Custom Logic: Use application code to resolve conflicts on write or read.

Multi-Leader Replication Topologies

Types of Topologies:
- All-to-All: Every leader sends writes to every other leader.
- Circular: Nodes forward writes in a circular manner.
- Star: One node forwards writes to all other nodes.
Challenges:
- Fault Tolerance: All-to-all is more fault-tolerant but may have ordering issues.
- Replication Loops: Use unique identifiers and version vectors to prevent loops.

Leaderless Replication

Concept:
- Any replica can accept writes; no leader. Writes are sent to all replicas in parallel.
Read and Write Mechanisms:
- Clients send write/read requests to multiple nodes. Version numbers determine the most recent data.
Mechanisms for Data Synchronisation:
- Read Repair: Stale replicas are updated during reads.
- Anti-Entropy Process: Background process reconciles differences between replicas.
Quorums:
- Writes must be confirmed by a minimum number (w) of nodes, and reads must query a minimum (r) number of nodes.
- Ensures up-to-date values as long as w + r > n (total replicas).

Limitations and Challenges

Sloppy Quorum: Increases write availability by allowing writes to any reachable node, even outside the designated ones.
Concurrent Write Detection: Requires knowledge of the database's conflict handling mechanism.
Version Vectors: Helps determine concurrent writes and ensures data consistency.

Key Takeaways

Multi-leader and leaderless replications offer flexibility and performance benefits but introduce complexities in conflict resolution.
Appropriate for specific use cases like multi-datacenter operations, offline clients, and collaborative applications.
Requires careful consideration of replication topology, conflict handling, and consistency mechanisms.

DDIA Chapter 5 - Replication - Part 2 - Thoughts and notes