If you encounter CPU spikes or systems with suboptimal performance at work, this article is tailored for you!

The Thundering Herd problem

The Thundering Herd problem in software engineering refers to a scenario where a large number of processes or threads are simultaneously awakened to handle an event, but they all depend on a limited resource. This causes unnecessary context switching, CPU usage spikes, and overall system performance degradation.

Take a BI application that gives the possibility to schedule an analysis dashboard. If 200 users schedule a dashboard that hits the same database which only supports up to 20 connections, just think what would happen?

Two hundred threads are run at the same time and all compete for the db connections. That causes resource contention which would be visible on a Grafana dashboard as a sudden spike.

Jitter

The time that it takes for a packet to travel from a host to your computer is called latency and it is measured in ms. In an ideal world it is supposed to be consistent across all packets, but it isn’t because each packet travels from a different route and there are other factors like network congestion that can affect the latency.

So, it is common for network monitoring tools to measure that deviation in the latency which is called jitter. It is an unwanted effect, hence we need to keep an eye on them.

How jittering can help?

To overcome the Thundering Herd problem we can add a deliberate random timing variation to each task to prevent multiple processes from executing simultaneously. This is usually done in the form of a customizable parameter/option in the system that can be adjusted according to the specification of the environment, e.g., query execution delay in a BI analytics system.

The following diagram shows the jitter effect with and without setting parameter max-jitter-in-seconds:

In the top timeline (Without Jitter):

All processes execute at exactly the same moment (time T)
This creates a resource usage spike (shown in red)
System resources are overwhelmed at a single point in time

In the bottom timeline (With Jitter):

Processes are distributed randomly between time T and T+max-jitter
Resource usage is spread out evenly over time (shown in green)
The system experiences consistent, manageable load instead of spikes

Reduce your carbon footprint

High utilization spikes often require more energy per computation than sustained moderate loads. When servers run at 100% CPU for brief periods, it generates heat spikes that trigger power-hungry cooling systems. Not to mention that Thundering Herd problem usually leads into multiple processes wake up only for most to fail and retry.

Incorporating jitter into your scheduling systems can significantly enhance efficiency by mitigating the Thundering Herd problem. By introducing random timing variations, you can prevent resource contention and reduce CPU spikes, leading to a more stable and manageable system load. This not only improves performance and scalability but also contributes to a more environmentally friendly operation by reducing energy consumption and carbon footprint. Embracing jitter is a strategic move towards optimizing system performance and sustainability.

Enhance Scheduler Efficiency with a Touch of Jitter

The Thundering Herd problem

Jitter

How jittering can help?

Reduce your carbon footprint

Subscribe to my newsletter

Amir

Amir