I’ve recently started to work on a new (to me) project that will make use of the European Southern Observatory’s (ESO’s) new Extremely Large Telescope (ELT) instrument control software framework (IFW). I’m currently trying to understand various parts of this epic software infrastructure. A core part seems to be the use of HashiCorp’s Nomad. My aim here is to arrive at a basic understanding of what Nomad does and make sense of why Nomad was chosen.

What is Nomad?

Nomad is a flexible scheduler and workload orchestrator that enables you to deploy and manage any application across on-premise and cloud infrastructure at scale.

That’s the official definition. To put it more simply, Nomad is like a logistics manager for software. As a developer, you define what your application needs and how it should run, and Nomad (the logistics expert) decides how and where to deploy it across your available infrastructure.

Advertised Features

Efficient resource usage
Nomad makes smart use of your available infrastructure by packing tasks together on machines in a way that avoids waste. This approach, known as bin packing, helps ensure resources like CPU and memory are fully used without overloading any one system.

Self-healing
Nomad keeps an eye on all running tasks. If something crashes or stops responding, it automatically restarts or reschedules it elsewhere, helping to keep your services available and resilient.

Zero downtime deployments
Nomad supports safe deployment strategies that prevent user disruption. Rolling updates replace old versions with new ones gradually. Blue/green deployments run the new version alongside the old one until it is ready to fully take over. Canary deployments roll out changes to a small portion of users first, allowing issues to be caught early before affecting everyone.

Supports many workload types
Nomad is highly flexible, able to run a wide range of applications. This includes container workloads like Docker, Java applications in JAR files, virtual machines via QEMU, or even raw scripts and system commands.

Cross-platform and portable
Nomad runs as a single lightweight binary and works across Linux, Windows, and macOS. You can use it to manage workloads running in your data centre, in the cloud, or at the edge – all from one system.

Simple, consistent job setup
You describe each application using a clear job file, where you define what the app is, how it should run, where it should run, and how it connects to other services. This means whether you're running a container, a VM, or a script, the process for setting it up in Nomad is always the same.

Key Terms

The following image was taken from an introductory Nomad tutorial, and I think gives a good overview of the core elements of the Nomad infrastructure.

Diagram illustrating the Nomad cluster terms

Setup Terms

Agent
An agent is a Nomad process that runs either as a server or a client. It forms the core of any Nomad deployment.

Client
A Nomad client runs the actual tasks. It registers itself with the servers and waits for work to be assigned. Clients are often called nodes, especially when discussing the infrastructure.

Server
Nomad servers handle all job scheduling and manage the clients. They decide where tasks should run and monitor the overall health of the system.

Development agent
A development agent is a special configuration used for local testing or learning. It runs as both a server and a client on the same machine and does not save any data to disk. This means it always starts in a clean, predictable state, making it ideal for quick experiments or demos.

Operational Terms

Task
A task is the smallest unit of work in Nomad. It runs through a task driver such as Docker or Exec, which allows Nomad to support different types of workloads. Each task defines which driver it needs, along with its configuration, constraints, and required resources.

Group
A group is a collection of tasks that are run together on the same client. Tasks in a group share resources and are scheduled onto the same machine.

Job
A job is the main way to define and manage an application in Nomad. It includes everything Nomad needs to know to run one or more tasks, such as configurations, constraints, and deployment rules.

Job specification
Also called a ‘jobspec’, this is the complete definition of a job. It includes details such as the type of job, its tasks and resource needs, where it can run, and how it should behave. It’s written in a clear, structured format so Nomad can understand and act on it.

Allocation
An allocation is the result of Nomad placing a job on a specific client. It represents the link between a task group and the machine where it runs. When a job is launched, Nomad selects a suitable client and reserves resources on it for the job's tasks.

Why Use Nomad?

Having worked on MOONS instrument control software, I am acutely aware of the number of (to use Nomad jargon) ‘tasks’ that will be required of the system, and that they may be distributed across several instrument or detector control workstations. Immediately, the use-case for Nomad becomes clear. As a method to define, deploy, manage, and oversee these hundreds of tasks, it seems like a wise choice. The idea that Nomad can automatically handle task distribution, restart failed processes, and ensure smooth deployments across multiple machines makes a lot of sense, especially for something as complex as the ELT system. Given all these capabilities, Nomad feels like it could play a crucial role in ensuring everything runs smoothly across these workstations on future Very Large Telescope (VLT) and ELT instruments.

Why use HashiCorp's Nomad?