Cluster Singletons using ShedLock
Overview
Imagine you have a service deployed in multiple instances across the entire globe that need to run scheduled tasks, exclusively one at a time, as cluster singletons.
How would you ensure that such a task deployed on multiple processors only run exclusively on one node at any given time, in a distributed environment?
This is fairly common requirement for batch ingest, sending user notifications, performing cleanups and other data housekeeping operations.
It suggests using some kind of distributed locking mechanism to help coordinate actions across a network of machines (nodes). Distributed locks are unfortunately an unsafe concept in a system of independent processors interacting over an asynchronous network. The problem is you can't tell the difference between a slow and failed processor which again is conceptualized in the two-general paradox.
There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton
Cache invalidation, naming things, exactly-once delivery and distributed locks are ultimately about the same tradeoffs in the end which are conceptualized by the CAP theorem.
A distributed lock service would need to provide for both safety and liveness. A safety property asserts that nothing bad happens during execution. A liveness property asserts that something good eventually happens. When mapped to lock semantics, a contradicting unsafe property would be to hand out the same lock to multiple clients. Then it wouldn't really be a lock anymore, in violation of contract.
The lack of liveness could be exemplified by having to wait indefinitely for a claimed lock to be released. For example, a client can hold a lock for an arbitrary long period of time and prevent other clients from acquiring the lock and make forward progress. That's the expected behavoir. However, what if that client is just gone fishing or stuck doing nothing? Then no-body can make any progress until something eventually happens.
Adding a lock timeout or putting a TTL (auto-timeout) on the lock doesn't fix the problem since (again) we can't tell the difference between a client gone fishing or just taking a very long time doing it's task.
To preserve safety and deliver a fault-tolerant locking service, one option is to rely on a storage system with a transactional guarantee and a monotonically increasing fencing token. A fencing token is simply a number that always increase monotonically whenever a lock is aquired. This token can then be used in a CAS operation to reject expired tokens that may come back later and haunt the lock service.
This would also involve the component that the lock is intended for, to become a stateful observer and part of the protocol and responsible to reject old tokens. It is the type of approach you find in Google's Chubby, etcd and recently also Hazelcast, with slightly different terminology.
In summary, if you need a scheduled tasks to run exclusively as singletons rather than concurrently, then a cluster singleton lock that can span beyond a single processor is your friend.
For the rest of this article we are going to look at ShedLock with CockroachDB, but without fencing tokens since ShedLock doesn't have it.
A geo-distributed database such as CockroachDB have the necessary mechanisms to manage consensus across a cluster, and that's what we are going to leverage for task scheduling locks.
Introduction to ShedLock
In this example we are going to use a small Java library called ShedLock along with Spring's built-in scheduling mechanism. Spring scheduling only provides scheduling and not any type of locking. For that, we will use ShedLock that will provide a simple locking mechanism needed to ensure cluster-singleton task execution. ShedLock on the other hand does not do any scheduling, only locking.
ShedLock provides an API for acquiring and releasing locks as well as connectors to a wide variety of lock providers such as Etcd, Zookeeper, Consul, Cassandra, Redis, Hazelcast and virtually any SQL database through JDBC.
As a side note, there are plenty of distributed schedulers also like Quartz and Chronos on Mesos that achieve similar goals.
These mechanisms combined, along with CockroachDB as a distributed SQL database, provides for a simple and easy-to-use cluster singleton task framework. CockroachDB:s strong consistency, fault-tolerance and distributed coordination provides the perfect foundation for representing the underpinning mutex mechanism. A CockroachDB cluster can span multiple regions and it provides the same transactional guarantees and strong consistency properties regardless of the deployment topology.
Limitations
One caveat with ShedLock is that it relies on lock timeouts. If a method holding a lock exceeds the set time limit and the lock expires, then it can be claimed to another thread / process.
Although it's documented behavior, it fails on the safety property and can hand out the same lock to multiple threads.
It means that a paused or busy method that exceed the timeout can come back live again and cause multiple side-effects since its no longer holding the lock.
There is a KeepAliveLockProvider
that can extend the lock period if needed, but it doesn't solve the underlying problem, just extends the time expiry. Lock timeouts is a problem that can be solved by using fencing tokens as described earlier.
Rule of thumb for ShedLock:
You have to set lockAtMostFor to a value which is much longer than normal execution time. If the task takes longer than lockAtMostFor the resulting behavior may be unpredictable (more than one process will effectively hold the lock).
So be aware of this limitation before using ShedLock for anything serious in a production environment.
Source Code
The source code for this example is available in GitHub.
Configuration
To use SchedLock with Spring Boot, first add the Maven dependency:
<dependency>
<groupId>net.javacrumbs.shedlock</groupId>
<artifactId>shedlock-spring</artifactId>
<version>4.42.0</version>
</dependency>
<dependency>
<groupId>net.javacrumbs.shedlock</groupId>
<artifactId>shedlock-provider-jdbc-template</artifactId>
<version>4.42.0</version>
</dependency>
Next, create a database table that will store information about the locks:
create table shedlock
(
name varchar(64) not null,
lock_until timestamp not null,
locked_at timestamp not null,
locked_by varchar(255) not null,
primary key (name)
);
Next, define the lock provider which will be using JDBC:
@Configuration
@EnableScheduling
@EnableSchedulerLock(defaultLockAtMostFor = "10m")
public class LockingConfiguration {
@Bean
public LockProvider lockProvider(DataSource dataSource) {
return new JdbcTemplateLockProvider(builder()
.withJdbcTemplate(new JdbcTemplate(dataSource))
.usingDbTime()
.build()
);
}
}
In the configuration above, notice @EnableScheduling
and @EnableSchedulerLock
which are the annotations used to enable scheduling and tailor default values for lock keep-alive. 10 min in this example.
Usage
Now we are ready to annotate the methods we would like to run following a schedule, and execute as cluster singletons. These methods will run exclusively one at a time (per method) no matter how many application instances or threads per instance we use.
For this purpose we are using the @Scheduled
and SchedulerLock
annotations.
@Service
public class ProductNewsFeed {
private final Logger logger = LoggerFactory.getLogger(getClass());
@Autowired
private ProductService productService;
@Scheduled(cron = "0/30 * * * * ?")
@SchedulerLock(lockAtLeastFor = "1m", lockAtMostFor = "5m", name = "publishNews_clusterSingleton")
public void publishNews() {
logger.info(">> Entered cluster singleton method");
// gone fishing
logger.info("<< Exiting cluster singleton method");
}
}
The @Scheduled
makes Spring create an underlying scheduled task to execute this method at a specified interval. It uses a cron expression which in this case means every 30 second.
The @SchedulerLock
is used by ShedLock to acquire a lock when this method goes into scope. The name must be unique since it's used as a key. The parameter lockAtLeastFor
is optional and means this method will hold the lock for 1 minute at a minimum. The parameter lockAtMostFor
is also optional and means this method will hold the lock for at most 5 minutes.
In case the lock holder or method execution goes beyond this time, the lock is up for grabs for another session/thread. Otherwise the lock is released when the method goes out of scope.
Demo
First build the demo app:
git clone git@github.com:kai-niemi/roach-spring-boot.git
./mvnw clean install
cd spring-boot-locking/target
Start first instance:
nohup java -jar spring-boot-locking.jar --server.port=8090 > locking-1-stdout.log 2>&1 &
Start second instance:
nohup java -jar spring-boot-locking.jar --server.port=8091 > locking-2-stdout.log 2>&1 &
Start third instance:
nohup java -jar spring-boot-locking.jar --server.port=8092 > locking-3-stdout.log 2>&1 &
You should then see "Gone fishing" only in one of the logs at any given time. You can tailor the timeout so that the task exceeds the time limit, at which point the lock guarantee goes out of the window.
Conclusion
In this article, we learned how to create and synchronize cluster singleton scheduled tasks using ShedLock and CockroachDB. ShedLock does come with a caveat that it relies on lock timeouts that is an unsafe construct.
Subscribe to my newsletter
Read articles from Kai Niemi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by