Continuing from learnings from previous blog on tackling the thundering herd problem, which discussed the viral incident where 13 million people attempted to book Coldplay tickets, it’s clear that such high-demand events shows the need for distributed systems to manage shared resources.

In situations where numerous users compete for a limited number of tickets, distributed locks are essential. They ensure that only one transaction can be processed at a time, preventing conflicts and double bookings.

In this article, we’ll explore distributed locks, how they differ from traditional locks used in multithreaded systems, and the role of fencing tokens in addressing these challenges.

What are Distributed Locks ?

A distributed lock is a way to control access to a shared resource in a system where multiple servers or processes are working together. Imagine you have a shared resource that two people want to edit at the same time. If both try to make changes simultaneously, they might mess up the document.

A distributed lock ensures that only one person can edit the document at a time. When one server wants to write to the resource, it asks for a "lock." If it gets the lock, it can proceed to make changes. If another server requests the lock while the first one has it, it has to wait until the lock is released. This way, the system prevents conflicts and keeps the data safe and consistent.

Properties of a Distributed Lock

Mutual Exclusion: Only one process can hold the lock at a time. This means only one process can access the shared resource, preventing conflicts.
Fault Tolerance: If a process crashes while holding the lock, the system should automatically release the lock so other processes can use it.
Deadlock Freedom: The system should avoid situations where two processes wait forever for each other to release locks.
Lease or Expiration: Locks should have a time limit. If a process doesn't release the lock in time, it will expire, allowing others to use it.
Consistency: ensures that all processes have the same view of the lock state. If one process acquires the lock, other processes should immediately see that the lock is held

Before we dive deep, some important terms, Leases and Tokens!

Leases and Tokens

What is a Lease?

A lease in the context of distributed locking refers to a time-limited permission granted to a process or node to hold a lock on a resource. When a node acquires a lease, it is given exclusive access to the resource for a specified period. If the node fails to release the lock within this timeframe, the lease automatically expires, allowing other nodes to access the resource.

Key Characteristics of Leases:

Time-Bound: Leases are typically associated with a duration, after which the lock is automatically released if not renewed.
Automatic Release: If the node holding the lease crashes or becomes unresponsive, the lease expiration prevents indefinite blocking of access to the resource.
Renewable: Nodes can renew their leases before expiration if they still require access to the resource.

What are Tokens?

Tokens are unique identifiers issued by the lock service when a lock is acquired. In the context of fencing, tokens ensure that operations on a shared resource are valid and up-to-date. Each time a node acquires a lock, it receives a token that indicates the current state of the lock.

Key Characteristics of Tokens:

Monotonically Increasing: Tokens are typically designed to be strictly increasing, meaning that each new token is higher than the previous ones. This property helps to maintain a clear order of operations.
Validation Mechanism: Tokens are used to verify the legitimacy of a lock holder before allowing it to perform operations. If a node attempts to operate using an outdated token, the operation is rejected.
Uniqueness: Each token is unique to a particular lock acquisition, providing a clear reference point for the state of the lock.

Now that we know some basics, lets understand whether a lock in single node multithreaded application safe.

Is the Lock entirely Safe?

It’s important to remember that a lock in a distributed system is not like in a multi-threaded application. It's more complex because different nodes and the network can fail independently in various ways.

Consider a simple example where are we are trying to lock a resource:

For example, consider an application where clients need to update files in shared storage systems (like S3).

The process works as follows:

Each client must first obtain a lock,
Proceed to read the file, make their necessary modifications, save the updated file back to storage
Finally release the lock.

This locking mechanism is crucial as it prevents multiple clients from executing this read-modify-write sequence simultaneously, which would otherwise lead to lost updates.

public void writeDataToStorage(String filename, String data) {
    var lock = lockService.acquireLock(filename);
    if (lock == null) {
        throw new RuntimeException("Failed to acquire lock");
    }

    try {
        var fileContent = storage.readFile(filename);
        var updatedContent = updateContents(fileContent, data);
        storage.writeFile(filename, updatedContent);
    } finally {
        lock.release();
    }
}

Well even with perfect lock, the code above is broken.

Lets look at below diagram, this is snippet for Martin Kleppmann’s blog ( We all know him as author of DDIA book )

Here's the sequence of events:

Node 1 initiates by trying to get the lock from the Lock Service with a lease period
Lock Service grants the lock to Node 1 with an "OK" response
Node 1 enters a Garbage Collector Pause state ( Stop the world ), which temporarily freezes its operations. During Node 1's GC pause, the lock expires (shown by the "expired" arrow)
In the same during Node 1's GC pause, Node 2 attempts to acquire the lock from the Lock Service
Node 2 successfully gets the lock ("OK" response) since Node 1's lock expired
Node 2 proceeds to write data to Storage.
Storage confirms the write operation with an "OK" response
Now GC Pause has completed, however Node 1 still assumes that it has lock and proceeds to write data to storage.

This is particularly dangerous because:

Node 1 has no way to know that its lock expired during the GC pause.
Node 1 might proceed with operations believing it has exclusive access.
Multiple nodes could end up making concurrent changes to the same resource.
This breaks the mutual exclusion guarantee that locks are supposed to provide.

Fencing Token to the Rescue!

A fencing token is a monotonically increasing number issued by the lock service whenever it grants a lock, which must be included with any operation on the resource. In simple terms:

It is a unique identifier (often a numeric value or timestamp) associated with each lock acquisition.
When a client acquires a lock, it also receives a fencing token.
This token is used to verify whether the lock holder is still valid before performing operations.

Lets look at how fencing token solves the problem mentioned above:

Node 1 tries to get the lock from the Lock Service
Lock Service grants the lock to Node 1 with Response: "OK (token: 98)"
Node 1 enters a Garbage Collector Pause state. Node 1 is now "frozen" but still holds token 98. During Node 1's GC pause, the lock expires. The "time expired" event invalidates Node 1's lock.
Node 2 requests the lock from Lock Service
Lock Service responds with "OK" and a higher token: 99. Note: Token 99 > Token 98 (monotonically increasing)
Node 2 attempts to write data to Storage with token 99
Storage accepts Node 2's write because token 99 is valid (highest seen)
After GC pause, Node 1 attempts to write data with its old token 98, since it is still under assumption that lock is still held
Storage rejects this write because token 98 < 99.

Note that this means the storage server needs to check the tokens and refuse any writes with tokens that are older than the latest one. As long as the lock service gives out tokens that always increase, the lock remains safe.

There are multiple services which provides a way to implement the same.

Apache Zookeeper: Zookeeper is a distributed coordination service that can be used to implement fencing through ephemeral nodes and versioning, helping to ensure that only one process can access a resource at a time.
etcd: A distributed key-value store that can be used for configuration management and service discovery. It supports lease-based locking, which can serve as a fencing mechanism.
Consul: HashiCorp's Consul provides service discovery and health checking, with support for distributed locking via session-based mechanisms that can act as fencing tokens.

Conclusion

In distributed systems, effective management of shared resource access is vital for maintaining data integrity. While distributed locks help ensure mutual exclusion, challenges arise with node failures and delays. Fencing tokens enhance this mechanism by validating the lock holder before allowing operations, preventing outdated actions from causing conflicts. By implementing these strategies, systems can achieve robust and reliable data management.

References:

https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html

Mastering Distributed Locks: Consistency using Fencing Tokens