Replication Explained: DDIA Chapter 5 Insights

Why replicate db data?

keep data close to customer (improve latency)
high availability
increased read throughput

Leaders and followers - who are they?

Leaders - well, the name suggests the meaning, all writes to the leader node
Followers - well, the name suggests the meaning, all reads go to the follower node

It is leader node's job to keep the followers up to date with the current data changes

Sync vs Async replication

So the leader can update the followers in a synchronous manner like when a row is updated, it tells the followers to update that row and until all the followers update that row the user will have to wait for the success response.

In async update the leader updates the row but doesn't wait for followers to do the same and sends the success response to the user.

Semi-sync approach is more better here, you have sync updates to 1-2 followers and the rest followers update async, this is how you maintain data durability and also the latency.

Setting up new followers

Setting up a follower can usually be done without downtime. The process looks like this:

Take a snapshot of the leader’s database
Copy the snapshot to the follower node
Follower requests data changes that have happened since the snapshot was taken
Once follower processed the backlog of data changes since snapshot, it has caught up.

Handling Node Outages

Follower Failure: Catchup Recovery
- Follower reconnects to the leader
- Requests all data changes missed while it was disconnected
Leader Failure: Failover
- A follower is promoted to a new leader
- Clients are reconfigured to direct writes to the new leader
- Other followers start consuming data changes from the new leader
Automatic Failover Steps
- Detect Leader Failure: A node is considered dead if it does not respond within a specific period
- Select a New Leader: The best candidate is the replica with the most up-to-date changes from the old leader
- Reconfigure the System: Ensure the old leader becomes a follower and recognises the new leader
Potential Issues that might occur during automatic failover
- Conflicting Writes: Asynchronous replication may result in the new leader receiving conflicting writes.
- Data Coordination: Discarding writes can be risky if other systems need to be synchronised with the database
- Split-Brain Scenario: Two nodes might mistakenly believe they are the leader, causing data loss or corruption
- Timeout Determination: Deciding the right time before declaring a leader dead can be challenging

Implementation of Replication Logs

Statement-Based Replication

The leader logs every SQL statement (e.g., INSERT, UPDATE, DELETE) and sends it to its followers.
Problems with Statement-Based Replication
- Non-deterministic Functions: Functions like NOW() or RAND() produce different values on replicas.
- Order Dependency: Statements relying on existing data (e.g., auto-increments) must execute in the same order on all replicas.
- Side Effects: Statements with side effects may result in different outcomes on each replica.
Solution
- Replace nondeterministic functions with fixed values on the leader.

Write-Ahead Log (WAL) Shipping

An append-only sequence of bytes records all writes to the database. The leader sends this log to its followers.
Used in PostgreSQL and Oracle
Disadvantages of WAL Shipping
- Low-Level Data Description: The log describes data changes at a low level (specific bytes and disk blocks), tying it closely to the storage engine.
- Version Incompatibility: Running different versions of the database on leaders and followers is usually not possible, impacting operations and complicating zero-downtime upgrades.

Logical (Row-Based) Log Replication

A sequence of records describing writes to database tables at the row level:
- Inserted Row: Contains new values for all columns.
- Deleted Row: Identifies the unique information for the deleted row.
- Updated Row: Contains unique row identifier and new values for columns.
Used by MySQL binlog.
Advantages of Logical Log Replication
- Decoupled from Storage Engine: Easier to maintain backward compatibility.
- Ease of Parsing: Beneficial for external applications like data warehouses, custom indexes, and caches (change data capture).

Trigger-Based Replication

Moves replication to the application layer.
A trigger executes custom application code automatically upon data changes, logging changes to a separate table for external processing.
Disadvantages of Trigger-Based Replication
- Overhead and Complexity: Higher overheads and more prone to bugs.
- Flexibility: Can be useful for specific use cases requiring custom logic.

I'll have to break this chapter into 3 parts otherwise the blog will become too long and that will be problematic for the goldfish to read as they have a small attention span.

DDIA - Chapter 5 - Replication - thoughts and notes - Part 1