How Message Queues Help Make Distributed Systems More Reliable
Reliable systems consistently perform their intended functions under various conditions while minimizing downtime and failures.
As internet users, we tend to take for granted that the systems that we use daily will operate reliably. In this article, we’ll explore how message queues enhance flexibility and fault tolerance. We’ll also discuss some challenges that we may face while using them.
After reading through, you’ll know how to implement reliable systems and what key performance factors to keep in mind.
Prerequisites
Before diving into this article, you should have a foundational understanding of cloud computing. Here are the key concepts:
Basic principles of Cloud Computing
Availability in Distributed Systems
An understanding of the CAP theorem.
Table of Contents
What Does Reliability Mean in the Context of Distributed Systems?
Reliability, according to the OED, is “the quality of being trustworthy or of performing consistently well”. We can translate this definition to the following in the context of distributed systems:
The ability of a technological system, device, or component to consistently and dependably perform its intended functions under various conditions over time. For instance, in the context of online banking, reliability refers to the consistent and secure processing of transactions. Users expect to complete transfers and access their accounts without errors or outages.
The system being resilient to unexpected or erroneous interactions by users / other systems interacting with it. For instance, if a user tries to access a deleted file on a cloud storage system, the system can gracefully notify them and suggest alternatives, rather than crashing.
The system performs satisfactorily under its expected conditions of operation, as well as in the case of unexpected load and/or disruptions. An example of this is a video streaming service during a major sporting event. The system is designed to perform well under normal traffic but must also handle sudden spikes in users when a popular game starts
This is quite a general view of what reliability is, and the definition changes with time, as systems change with changing technology.
What Makes Software Reliable?
There are various key components that are used industry wide to make distributed software reliable as used across large scale systems.
Data Replication
Data replication is a fundamental concept in system design where data is intentionally duplicated and stored in multiple locations or servers.
This redundancy serves several critical purposes, including enhancing data availability, improving fault tolerance, and enabling load balancing.
By replicating data across different nodes or data centers, we may be able to ensure that, in the event of a hardware failure or network issue, the data remains accessible. This reduces downtime and enhances system reliability.
It's essential to implement replication strategies carefully, considering factors like consistency, synchronization, and conflict resolution to maintain data integrity and reliability in distributed systems.
Let’s look at a concrete example. With a primary-secondary database model such as one used with e-commerce websites, we may have the following:
Replication: The primary database handles all the write operations, whereas the secondary database(s) handles all the reads. This ensures that reads are spread out across multiple databases, enhancing performance and lowering the probability of a crash.
Consistency: The system may use eventual consistency to maintain integrity, ensuring that all replicas eventually reflect the same data. But during high-traffic periods, the website may temporarily allow for slight inconsistencies, such as showing outdated inventory levels.
Conflict Resolution: If two users attempt to buy a single available item at the same time, a conflict resolution strategy may be used. For instance, the system could use timestamps to determine the customer who gets assigned the product, and this may dictate database updates eventually.
Load Distribution Across Machines
Load distribution involves distributing computational tasks and network traffic across multiple servers or resources to optimize performance and ensure system scalability.
By intelligently spreading workloads, load distribution prevents any single server from becoming overwhelmed, reducing the risk of bottlenecks and downtime.
Some very commonly used load distribution mechanisms are:
Using Load Balancers: A load balancer can evenly distribute incoming traffic across multiple servers, preventing any single server from becoming a bottleneck.
Dynamic Scaling: Dynamic or auto-scaling can be used to automatically adjust the number of active servers based on current demand, adding more resources during peak times and scaling down during low traffic.
Caching: Caching layers can be used to store frequently accessed data, reducing the load on backend servers by serving requests directly from the cache.
Capacity Planning
Capacity planning entails analyzing factors such as expected user growth, data storage requirements, and processing capabilities to ensure that the system can handle increased loads without performance degradation or downtime.
By accurately forecasting resource needs and scaling infrastructure accordingly, such planning helps optimize costs, maintain reliability, and provide a seamless user experience. Being proactive can help ensure a system is well-prepared to adapt to changing requirements and remains robust and efficient throughout its lifecycle.
A lot of modern systems can scale automatically with projected loads. When traffic or processing requirements increase, such auto scaling automatically provisions additional resources to handle the load. Conversely, when demand decreases, it scales down resources to optimize cost efficiency.
Metrics and Automated Alerting
Metrics involve collecting and analyzing data points that provide insights into various aspects of system behavior, such as resource utilization, response times, error rates, and more.
Automated alerting complements metrics by enabling proactive monitoring. This involves setting predefined thresholds or conditions based on metrics. When a metric crosses or exceeds these thresholds, automated alerts get triggered. These alerts can notify system administrators or operators, allowing them to take immediate action to address potential issues before they impact the system or users.
When used together, metrics and automated alerting create a robust monitoring and troubleshooting system, helping ensure that anomalies or problems are quickly detected and resolved.
Now that you know a bit about what reliability means in the context of Distributed Systems, we can move on to Message Queues.
What is a Message Queue?
A message queue is a communication mechanism used in distributed systems to enable asynchronous communication between different components or services. It acts as an intermediary that allows one component to send a message to another without the need for direct, synchronous communication.
Above, you can see that there are multiple nodes (called Producers) that create messages that are sent to a message queue. These messages are processed by a node called the Consumer node, which may perform a series of actions (for instance database reads, or writes) as a part of each message being processed.
Now let’s look at an actual example where a message queue may be useful. Let’s assume we have an e-commerce website that allows millions of orders to be processed.
Processing an order may take place in the following steps:
A user creates an order. This sets off a request to a web server, that in turn creates a message that is placed in the orders queue.
A consumer reads the message, and in turn calls different services while processing the message (for instance the inventory checks, the payment service, the shipping service)
Once all processing steps have completed, the consumer removes the message from the queue.
Note that in case there are parts of the system that fail, the message can be left in the queue to be re-processed.
Even in cases where there is a total outage on the processing side of things, messages can simply pile up in the queue and be consumed once services are functional again. This is an example of a queue being useful in multiple failure scenarios.
Let’s look at some code for this scenario using AWS SQS, which is a popular message queue service that allows users to create queues, send messages to the queue, and also consume messages from queues for processing.
The below example uses Boto3 which is a Python Client for AWS SQS.
First, we’ll place an order, assuming we already have an SQS queue called OrderQueue in place.
import boto3
import json
# Create an SQS client
sqs = boto3.client('sqs')
# Let's assume the queue is called OrderQueue
# This is the queue in which orders are placed
queue_url = 'https://sqs.us-east-1.amazonaws.com/2233334/OrderQueue'
# Function to send an order message
# This places an order in the queue, which can at any time be
# picked up by a consumer and then processed
def send_order(order_details):
message_body = json.dumps(order_details)
response = sqs.send_message(
QueueUrl=queue_url,
MessageBody=message_body
)
print(f'Order sent with ID: {response["MessageId"]}')
# Using the queue to place an order
# Defining a sample order
order = {
'order_id': '12345',
'customer_id': '67890',
'items': [
{'product_id': 'abc123', 'quantity': 2},
{'product_id': 'xyz456', 'quantity': 1}
],
'total_price': 59.99
}
# Sending the order to the queue which is expected to be picked up
# by a consumer and processed eventually.
send_order(order)
Then once the order has been placed, here’s some code that illustrates how it’ll be picked up for processing:
import boto3
import json
# Create an SQS client
sqs = boto3.client('sqs')
# Processing orders from the same queue defined above
queue_url = 'https://sqs.us-east-1.amazonaws.com/2233334/OrderQueue'
# Function to receive and process orders
# Picking up a maximum of 10 messages at a time to process
def receive_orders():
response = sqs.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=10, # Up to 10 messages
WaitTimeSeconds=10
)
messages = response.get('Messages', [])
for message in messages:
order_details = json.loads(message['Body'])
print(f'Processing order: {order_details}')
# Processing the order with details such as
# processing payments, updating the inventory levels,
# processing shipping etc.
# Delete the message after processing
# This is important since we don't want an
# order to be processed multiple times.
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message['ReceiptHandle']
)
# Receive a batch of orders
receive_orders()
What is an Intermediary in a Distributed System?
In the context of what we’re discussing here, a message queue is an intermediary. Quoting Amazon AWS’ definition of a message queue:
“Amazon Simple Queue Service (Amazon SQS) lets you send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available.”
This is a wonderfully succinct and accurate description of why a message queue (an intermediary) is important.
In a message queue, messages are placed in a queue data structure, which you can think of as a temporary storage area. The producer places messages in the queue, and the consumer retrieves and processes them at its own pace. This decoupling of producers and consumers allows for greater flexibility, scalability, and fault tolerance in distributed systems.
How Message Queues Help Make Distributed Systems More Reliable
Now let's discuss how Message Queues help make Distributed Systems more reliable.
1. Message Queues Provide Flexibility
Message queues allow for asynchronous communication between components. This means that producers can send messages to the queue without waiting for immediate processing by consumers. This allows components to work independently and at their own pace, providing flexibility in terms of processing times. So this is a great way to make designs flexible, and as self contained as possible.
2. Message Queues Make Systems Scalable
Message queues are often the bread and butter of scalable distributed systems for the following reasons:
Multiple producers can add messages to a message queue. This raises the ceiling and allows us to easily horizontally scale applications.
Multiple consumers can read from a message queue. This again allows us to easily scale throughput if needed in a lot of scenarios.
3. Message Queues Make Systems Fault Tolerant
What happens if a distributed system is overwhelmed? We sometimes need to have the ability to cut the cord in order to get the system back to a working state. We’d ideally want the ability to process requests that weren’t processed when the system was down.
This is exactly what a message queue can help us with. We may have hundreds of thousands of requests that weren’t processed, but are still in the queue. These can be processed once our system is back online.
Challenges with Message Queues
As with life, using message queues in distributed systems isn’t a silver bullet to scaling problems.
Here are some situations where message queues may be useful:
Asynchronous Processing: Messages queues are generally an excellent choice in infrastructure wherever asynchronous processing is required. In workflows such as sending confirmation emails or generating reports after an order is placed, message queues can decouple these tasks from the primary application flow.
Load Balancing: As we saw in our example for message queues, in scenarios where traffic spikes occur, message queues can buffer incoming requests, allowing multiple consumers to process messages concurrently. This helps distribute the load evenly across available resources.
Fault Tolerance: In systems where reliability is crucial, message queues provide a mechanism for handling failures. If a service is temporarily down, messages can be retained in the queue until the service is available again, ensuring that no data is lost unless intended.
Here are a some situations where message queues may not be useful:
Message queues can be great in scenarios where ordering of messages does not matter. But in situations where order does matter, they can sometimes be slow and more expensive to use.
Designing systems with queues that have multiple consumers isn’t trivial. What happens if a message is processed twice? Is idempotency a requirement? Or does it break our use case? These complexities can often lead us to situations where message queues may not be the best solution.
Summary
In this article, you learned about reliability in distributed systems, and how message queues can help make such systems more reliable. Here’s a summary of the key takeaways:
Reliability is central to distributed systems and there are a few common ways this is handled across the tech industry. Data replication, load distribution, and capacity planning are some ways that can improve the reliability of a system.
Message Queues are intermediaries that can store messages from producers. They can be picked up by consumers at a rate that's generally independent of the rate of production.
Queues are flexible, allowing us to immediately stem the flow of unwanted event processing in case of an unforeseen event.
Despite the versatility of message queues, they're not a panacea for reliability issues. There are often multiple considerations to be kept in mind while processing messages in a message queue.
Subscribe to my newsletter
Read articles from Anant Chowdhary directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by