NATS Cluster Architectures: Regional Clusters - Building Reliable Messaging Foundations

Joshua StewardJoshua Steward
8 min read

Why does NATS topology matter?

NATS topology refers to the way your NATS servers are arranged and connected. Unlike many messaging systems with more rigid deployment models, NATS offers remarkable flexibility. You're not shoehorned into a one-size-fits-all approach; instead, you become the architect of your messaging fabric.

Your choice of topology, whether a single server, resilient cluster, leaf nodes, a supercluster, etc. will directly shape fault tolerance, message delivery guarantees, network segmentation, and overall system complexity. This means you can tailor your NATS deployment precisely to your needs, but it requires careful consideration.

This miniseries of posts will explore a variety of NATS server topologies, increasing in their complexity. Here, we’ll be covering:

  1. The single instance deployment

  2. A resilient three-node cluster

And for each:

  • Key characteristics of each topology

  • Tradeoffs in consistency, latency, and availability

  • Potential use cases

  • Impact on application integration, particularly within C#/.NET clients


The Singularity: Single NATS Server

A single NATS server is the most fundamental setup, deployed as a single instance of the NATS server. It's the easiest to grasp and get started with, making it a natural entry point for development and simple use cases. This is the same deployment we used in “Getting Started with NATS in C#

In this topology, a single NATS server process handles all client connections, message routing, and any configured persistence, i.e. JetStream, if enabled. All publishers and subscribers within our system connect directly to this single endpoint.

Characteristics at a Glance

  • Simplicity: Straightforward to configure and manage

  • Low Overhead: Requires the least amount of infrastructure resources, i.e. CPU, memory, network

  • Single Point of Failure: If the server becomes unavailable due to hardware failure, network issues, or software problems, the entire messaging system halts

  • Limited Scalability: The performance and throughput of the system are bounded by the capacity of the single server instance

    • Note: Although, a single NATS server can still be quite performant, to the tune of millions of messages/sec depending on resources
  • No Redundancy: Availability limited to the single node, no automatic failover or backup

  • No JetStream Replication: Of course, with a single server, no JetStream resources can be replicated

Ideal Use Cases - Dev/Local/Non-Prod

  • Local Dev and Testing: Perfect for a local dev NATS environment and local integration testing

  • Demos, POCs: Proving new messaging patterns among microservices, demoing new features, etc.

  • Learning and Exploration: Low barrier of entry to understand the basic concepts of NATS without the complexities of clustering

NATS Server Config

Couldn’t be simpler! This specifies that clients should connect on the default port 4222 and the server will be listening on 0.0.0.0:4222

port: 4222

Mounting a file for just this might be overkill, you can simply specify many options via the command line as well. While 4222 is the default, just for example, with Docker, this is equivalent to the server.conf above

docker run --rm --name my-nats -p "4222:4222" nats -p 4222

NATS.Net Integration

Connecting to a single NATS server will use a connection string pointing to its address, just the same as we saw in “Getting Started with NATS in C#”. For completeness, the NATS.NET client library provides the tools for this. Importantly, in this topology, our app becomes directly reliant on the availability of this single server. As a result, several factors will impact our app design.

  • Connection Management: The NATS.Client.Core library does provide robust connection management, but with a single server, its primarily focused on the initial connection and handling disconnects

  • Initial Connection: By default, the client will lazily connect to the server, and fail fast on the initial connection attempt, i.e. the initial attempt is not retried and a NatsException is thrown

  • Reconnects: After an initial successful connection, any drop will result in a reconnect attempt up to a configured limit via NatsOpts.*Reconnect* properties

  • No Automatic Failover: Reconnection attempts have no other route except the same single address

var loggerFactory = LoggerFactory.Create(builder => builder.AddConsole());
var logger = loggerFactory.CreateLogger<Program>();

var opts = new NatsOpts
{
    Url = "nats://localhost:4222",
    LoggerFactory = loggerFactory
};

await using var connection = new NatsConnection(opts);

var rtt = await connection.PingAsync(CancellationToken.None);

logger.LogInformation("Ping successful - {RTT}ms to {@ServerInfo}",
    rtt.TotalMilliseconds,
    connection.ServerInfo);

Resilience Through Redundancy: Three Node Cluster

Clustering in NATS provides fault tolerance and increased capacity by connecting multiple NATS servers. A cluster of three nodes is a common starting point for production deployments and offers a balance between resilience and complexity.

In this topology, three NATS servers are interconnected via routes, and form a full mesh. Each server maintains bidirectional connections to the others, allowing them to fully propagate the Subject interest graph. Clients can connect to any server in the cluster, and the cluster will route messages appropriately, even if a server becomes unavailable.

Characteristics at a Glance

  • Fault Tolerance & Increased Availability: The cluster can still serve Core NATS traffic even in the face of a two node failure, i.e. operates with only a single node

  • Improved Scalability: Distributes the load across multiple servers, increasing the overall capacity of the system.

  • Automatic Discovery: NATS servers in a cluster will automatically discover each other after reaching any ‘seed’ server, cluster gossip ensures a full mesh is maintained

  • Message Routing: Servers route messages to the appropriate destinations within the cluster, cluster is transparent to publisher/subscribers beyond any connection string

  • Increased Complexity: Although still minimal, configuring and managing a cluster does involve more complexity than a single server

JetStream Considerations

Clustering is crucial for JetStream's fault tolerance and data durability. JetStream leverages the Raft consensus algorithm to replicate data across multiple servers within a cluster. This ensures, to the limit of consensus, that the data remains available and consistent.

  • Quorum: JetStream requires a quorum, \(1/2*n+1\), to be available in order to handle requests, e.g. reads, writes, acks, etc.

  • Storage: Each server in the cluster needs its own storage for JetStream data

  • Replication Latency: Replication adds some network overhead, low latency network connections are critical

Replication and Consistency

  • Replication Factor: With three nodes, 3x replication of JetStream Stream and Consumer data/metadata can be achieved

  • Consistency: Raft ensures strong consistency, meaning that all replicas agree on the order of the data

    • Note: Replication is set at the Stream level and either inherited or optionally set at the Consumer level, “In Sync Replicas” will always be equal to Replicas, i.e. acks = all
  • Fault Tolerance: A replication factor of 3 allows the cluster to tolerate the failure of up to one server without losing any data, and continuing servicing requests

  • Increased Durability: Storing multiple copies of the data significantly reduces the risk of data loss due to hardware failures or other issues

Ideal Use Cases - Single Region Production

  • Single Region Production Environments: Essential for any production system that requires high availability and fault tolerance

  • Durable, Highly Available Apps: Applications where data loss or downtime is unacceptable

  • Scalable Systems: Systems that need to handle a high volume of messages and clients

NATS Server Config

A basic example of a server.conf, you could create three separate configuration files (server1.conf, server2.conf, and server3.conf), each with the appropriate settings for that particular server. Although, the config can be reused depending on your networking environment. Note that routes pointed to “self” are smartly ignored making config sharing a bit easier.

listen: 0.0.0.0:4222
cluster {
  name: my-cluster
  listen: 0.0.0.0:6222
  routes: [
    "nats://server1:6222",
    "nats://server2:6222",
    "nats://server3:6222"
  ]
}
jetstream {
  store_dir: /data/jetstream
  max_memory_store: 1GB
  max_file_store: 10GB
}

Key Differences

  • cluster: Section defines the details of our cluster

  • cluster.name: Uniquely identifies the cluster

  • cluster.listen: The host and port to listen on for incoming clustering connections

  • cluster.routes: A list of dedicated Routes to other servers

  • jetstream: Configures JetStream, including storage target and limits

NATS.Net Integration

Connecting to a NATS cluster with the NATS.Client.Core and NATS.Client.JetStream libraries is straightforward. This time, you can provide a list of server URLs, and the client will automatically connect to the first available server. If that server becomes unavailable, the client will attempt to connect to another server in the list.

var loggerFactory = LoggerFactory.Create(builder => builder.AddConsole());
var logger = loggerFactory.CreateLogger<Program>();

var opts = new NatsOpts
{
    Url = "nats://server1:4222,nats://server2:4222,nats://server3:4222",
    LoggerFactory = loggerFactory
};

var connection = await GetConnectionAsync(logger, opts);
var stream = await GetStreamAsync(logger, connection);

async Task<INatsConnection> GetConnectionAsync(ILogger logger, NatsOpts opts)
{
    await using var connection = new NatsConnection(opts);

    var rtt = await connection.PingAsync(CancellationToken.None);

    logger.LogInformation("Ping successful - {RTT}ms to {@ServerInfo}",
        rtt.TotalMilliseconds,
        connection.ServerInfo);
    return connection;
}

async Task<INatsJSStream> GetStreamAsync(ILogger logger, INatsConnection connection)
{
    var context = new NatsJSContext(connection);
    var config = new StreamConfig("my-first-stream", ["some.subjects.>"])
    {
        NumReplicas = 3
    };

    var stream = await context.CreateOrUpdateStreamAsync(config, CancellationToken.None);
    logger.LogInformation("Stream created/updated - {@StreamInfo}", stream.Info);
    return stream;
}

Key Differences

  • NatsOpts.Url: Set to a comma separated list of server URLs

  • NATS.Client.Core: Handles connecting among multiple servers, starting with the first available server and reconnecting to another server if the connection is lost

  • NatsJSContext: Building on top of Core NATS, the JetStream Context accepts an existing NatsConnection

  • my-first-stream: Configured with a replication factor of 3, ensuring that data is replicated across all three servers in the cluster


Wrap Up

In this first part exploring NATS topologies, from single to three node cluster, we've laid the groundwork for building robust and scalable messaging systems with NATS.

Next Up - Multiregional Clusters!

Moving beyond single region clusters, we're now ready to take our NATS deployments to the next level. In Part 2: Multiregional Clusters, we'll dive into how Superclusters and Leaf Nodes enable you to connect NATS deployments across geographical distances, building resilient and scalable messaging solutions that span the globe. Get ready to unleash the full potential of NATS for distributed, mission-critical applications!

Have a specific question about NATS? Want a specific topic covered? Drop it in the comments!

0
Subscribe to my newsletter

Read articles from Joshua Steward directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Joshua Steward
Joshua Steward

Engineering leader specializing in distributed event driven architectures, with a proven track record of building and mentoring high performing teams. My core expertise lies in dotnet/C#, modern messaging platforms, and Microsoft Azure, where I've architected and implemented highly scalable and available solutions. Explore my insights and deep dives into event driven architecture, patterns, and practices on my platform https://concurrentflows.com/. Having led engineering teams in collaborative and remote-first environments, I prioritize mentorship, clear communication, and aligning technical roadmaps with stakeholder needs. My leadership foundation was strengthened through experience as a Sergeant in the U.S. Marine Corps, where I honed skills in team building and operational excellence.