Replication formats in databases
In this article, we are going to understand the different replication formats.
Problem statement
We have 2 database nodes.
The first one is a source node. This node is directly communicating with the application. All the writes are applied to this node.
The second one is a replica node. This node is only used for reading. All the changes to the source node are replicated to this node. The application can read data from this node for scalability.
How to replicate the data from the source to the replica?
Replication using logs
One straightforward way can be to take a dump of the source and load it into the replica. This approach would make sense if we are replicating the data just once.
However, if we want to keep the replica up-to-date with all the changes in the source, this approach can be very limiting. It’s not practical to take a dump of the source every time a change is made to it and then load it in the replica.
A more practical approach would be a system that allows us to just apply the changes made to the source over the replica. This can be achieved using the binary logs.
The binary logs contain information about all the changes made to the database such as the DDL and DML operations. It does not contain information about queries that do not modify the data such as SELECT
queries. This information can be found in the General query log.
Enabling binary logs has a minor performance impact on the database but its benefits outweigh its costs.
Since the binary logs contain the information for the different changes made to the database, we can use this information for various purposes -
Apply the incremental changes to the replicas.
Restore the database in case of failure. If the latest backup does not include the up-to-date changes, we can use the binary logs to apply the missing transactions.
Binary log formats
There are 3 formats in which binary logs can be written - statement based logging, row based logging and mixed logging.
Let us take a look at these formats in detail.
Statement based logging
Statement based logging is used for Statement based replication (SBR).
In this type of logging, each SQL statement is output into the logs. This SQL statement could be modifying a single row (such as UPDATE users SET active = false WHERE id = 1;
) or it could be modifying multiple rows (such as UPDATE users SET active = false WHERE users.last_logged_in < DATE_SUB(NOW(), INTERVAL 1 DAY);
).
Advantages -
Less logging - Since the SQL statements are logged, even if a statement is modifying 1 million rows, only 1 statement is logged.
Audit - Log files contain all the statements that made any changes. So they can be used to audit the database.
Disadvantages -
- Non-deterministic statements - Statements that can produce different results when run on different nodes, even when run with the same inputs, cannot be reliably replicated using SBR. Such statements include the statements making use of functions such as
UUID()
,NOW()
,RAND()
, etc. These functions can return different values in different replicas.
I recommend checking this documentation for more details on SBR.
Row based logging
Row based logging is used for Row based replication (RBR).
In this type of logging, the changes made to every row are logged separately even if such changes have been introduced by 1 SQL statement.
Advantages -
Safe - This is the safest form of replication. The non-deterministic statements in SBR can be replicated in a deterministic manner using RBR. This is because when non-deterministic functions are used, instead of logging the statement with those functions, the changes to the rows are logged.
Less locks - SBR can acquire more locks and for longer durations. RBR acquires lesser locks and for shorter durations. e.g. when a statement makes changes to multiple rows, in SBR while replicating such as statement, all such rows will be locked. Also if more rows are read but not modified, they will also be locked. But in RBR, the replication is done on a row-by-row basis. So only the row, that is being updated, is locked at a time.
Disadvantages -
More logs - The main disadvantage is that it writes a lot of logs. For instance, if this statement (
UPDATE users SET active = false WHERE users.last_logged_in < DATE_SUB(NOW(), INTERVAL 1 DAY);
) is modifying 1 million rows, 1 million log statements will be added as opposed to just 1 statement in SBR.Delays - This can cause delays during both replication / restoration as well as during logging as the it would take longer to read and write all these logs. This can, in turn, affect concurrency.
I recommend reading this documentation for more details on RBR.
Mixed logging
In this type of logging, the database uses statement based logging by default and switches to row based logging when unsafe statements are encountered.
References
Subscribe to my newsletter
Read articles from Shreyansh Gupta directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by