Building a Distributed Job Scheduler from scratch (Part 1)

Snehasish RoySnehasish Roy
3 min read

Distributed job schedulers are essential because they allow us to schedule callbacks without worrying about the scalability and reliability aspects. You can try doing what a distributed job scheduler does using ScheduledThreadPool but that won't guarantee callbacks as it does not guarantee durability - if the underlying machine crashes, so will the thread pool.

In this multi-part series, we're rolling up our sleeves to build a robust distributed job scheduler from scratch. Get ready to dive into the world of distributed systems!


Understanding the requirements

Before we jump into the technical details, let's establish a clear understanding of what our distributed job scheduler needs to achieve.

Job Types

Our platform should support three types of jobs:

  • Once - These jobs need to be scheduled only once at a specified date and time, such as scheduling a job for August 1, 2023, at 23:00.

  • Repeated - Repeated jobs occur within a defined date range, with a specified time interval between each occurrence. For example, scheduling a job to run every 30 minutes between August 1 and August 31, 2023.

  • Recurring - Recurring jobs are scheduled for specific dates and times, e.g., on August 1, 2023, at 16:00, and on August 5, 2023, at 12:30.

How to configure callbacks?

To offer flexibility, our system must allow clients to configure various aspects of their callbacks:

  • Retry strategies - Defines what happens if a callback fails. Should it be retried, and if so, what should the retry strategy entail?

  • Auth token - Provide an authentication token for client-side verification during callbacks.

  • Callback Path/URL - The actual HTTP URL where callback will be made.

  • Headers to pass - Any custom application headers to pass during the callback.

  • Success status codes - How to interpret whether the callback succeeded? Simply relying on 200 won't suffice for all the client use cases.

  • Relevancy window - Defines the maximum window for callback execution e.g. expected time of callback is 13:00 but that job somehow got picked up at 13:30, Is that job still valid? This can be configured by the client by providing the relevancy window. If the relevancy window <= 30 minutes, the callback can be performed, otherwise, it can be skipped.

With these functional requirements in mind, let's move on to considering some non-functional aspects that will shape our system.


Durability

If a client has received a successful acknowledgment of a job being accepted from our platform, job details must be persisted in durable storage.

Scalability

Design various components of the overall system keeping scalability in mind. Keep the components loosely coupled so that one can be scaled independently of the other.

Callback Guarantees

In a distributed system, it's difficult to make guarantees - especially exactly once, so let's go with it i.e. ensure at least one callback of the scheduled job to our client. Clients might receive multiple callbacks but it's up to them to identify and decide whether or not to process those duplicate requests.


Conclusion

Congratulations on making it this far! In this first part of our tutorial series on building a distributed job scheduler, we've outlined the essential functional and non-functional requirements. Let's take a pause and think about how are we going to implement a job scheduler based on the above requirements. I will not directly go into drawing boxes and assigning responsibilities to those boxes - instead, we will first figure out what kind of work we actually have to do and then we will decide whether or not we need a component to handle these kinds of tasks.

In the next installment, we will start by constructing a durable storage system that can persist job details efficiently and allow for quick lookups based on job IDs.

Link to part 2: https://snehasishroy.com/building-a-distributed-job-scheduler-from-scratch-part-2

0
Subscribe to my newsletter

Read articles from Snehasish Roy directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Snehasish Roy
Snehasish Roy

Experienced Software developer with a demonstrated history of working in the financial services and product industry. Worked on various projects over the years to improve customer satisfaction by making things faster and better. Proficient with functional and reactive paradigm. Skilled in Java 8, MVC & Spring framework, Distributed Databases (MemSQL, Greenplum, Aerospike) along with Kafka, ElasticSearch and Kibana Stack. Completed Bachelor of Technology (BTech) with Honors focused in IT from IIIT Allahabad with a CGPI of 9.15. Highly interested in solving complex technical/business problems by leveraging distributed systems. Occasionally have found security bugs while pen-testing random android apps e.g. BetterHalf.ai (Did a responsible disclosure). Competitive Programming Stats: LeetCode: Max Contest Rating of 2011, with a worldwide ranking of ~7K out of ~220K users. Best ranking of 228 in LeetCode Biweekly Contest 56. Second Best ranking of 466 in LeetCode Biweekly Contest 60. Third Best ranking of 578 in LeetCode Biweekly Contest 74. CodeForces: Max rating of 1423 (Specialist) CodeChef: Max rating of 1665 GeeksForGeeks: Achieved 27 rank out of ~1200 contestants in GFG Coding Challenge https://practice.geeksforgeeks.org/contest/the-easiest-ever-coding-challenge/leaderboard/ https://drive.google.com/file/d/1YS8GoZtE2nH0dnlcGWqWnjxzZbVt1WFh/view Facebook Hacker Cup 2021 Qualification Round 2021: Rank 746 Worldwide, Rank 149 India Round 1 2021: Rank 1327 Worldwide, Rank 220 India Round 2 2021: Rank 2775 Worldwide, Rank 527 India https://www.facebook.com/codingcompetitions/hacker-cup/2021/certificate/661693404384805