A song of SLI, SLO and SLA

"Accountability breeds response-ability."

The quote is from a book authored by Stephen Covey, "The 7 Habits of Highly Effective People." In this book, Covey elaborately discusses how accountability to an individual or a team builds ownership and commitment in work. But it does not end here, people move toward skill improvement through accountability, making thoughtful decisions, and practical actions that have them well enough to face any situation. Covey defines this condition of response as "response-ability." Let us think and see how this concept applies to our software world, where we mostly reside, surrounded by code and systems.

In today's dynamic software landscape, striking a balance between speed and resilience is one of the toughest challenge any team faces. But here is the real puzzle: How do we quantify our approach and impact in terms of how reliable a system is? Is there a magic wand? A speedometer like that of a car? Or are there specific metrics that we should consider in dealing with distributed system? It revolves around a fundamental concept that I like to call "A Song of SLI, SLO and SLA." Yes, you read that right, it is a song. But don't expect any Targaryens riding dragons this time. It is more like a melody to reliability folks, and they love to vibe with it. As it help in keeping essential metrics, establishing accountability, and making informed decisions based on system requirements.

Now from a bird eye view let us understand this through an example. Imagine you are preparing for a university entrance exam. You start by identifying key study areas like, practice writing comprehension, math problems, important physics formulas, fixing your sleeping hours, and others. You create a daily schedule, setting aside specific times for each subject. After a week, you find that you're answering 70% of the questions correctly but exceeding the exam duration of two hours. To improve, you add a revision slot to your timetable. Two months later, your accuracy is up to 75%, and you’re finishing within time limit, but you need at least 80% to compete effectively.

You tweak your schedule, adding more revision slots and focusing on weaker areas. Over six months you refine your plan, eventually reaching 88% accuracy and saving 15 minutes. Finally on exam day, you scored 92% within the time limit. If it process sounds familiar, then congratulations! You unknowingly worked with SLIs, set related SLOs, and accomplished SLAs in your study strategy in one way or another. However let us try to understand the depth of these three terms one at a time.

Service Level Indicator (SLI)

An SLI is a quantitative measure that helps us in understanding how a service is or some specific aspect of a system are performing. For example: Imagine the student’s test scores in different subjects during their first semesters, where each student’s score on a test is an SLI. It is a measure of how a student is performing. Some commonly used SLIs in software engineering are:

Latency: It helps in measuring the proportion of requests processed within a specified time frame.
Error Rate: It helps in determining the proportion of failed requests compared to total requests.
Availability: It is used to track the percentage of time a service is operational and accessible.
CPU and Disk usage: It helps in understanding the system health and performance over time.

Most of the time while choosing SLIs we have numerous options. However, the most effective way to avoid getting stuck in the SLIs universe is to focus on what your business need most, and just keep it simple as possible. This helps in reducing the overall costs and effort, as there are fewer metrics to track.

Service Level Objective (SLO)

SLOs are target value or range of values set for a specific SLI over a defined period. It helps in ensuring that the service meets the desired level of performance and reliability over times. For example, the teacher sets a goal that 90% of students should score at least 70% in the next test. This goal is the SLO. It is a performance objective for the class based on the students test scores. Some commonly used SLOs includes:

Disk Usage: Maintain disk space utilization below 90% on all critical storage volumes or Ensure average disk I/O wait time remains below 10 milliseconds over a rolling thirty days period.
CPU Usage: Ensure that CPU utilization does not exceed 85% for more than five minutes within any twenty-four hour period, or Maintain average CPU utilization below 85% over a rolling thirty day period.
Error Rate: Less than 1% of database writes fail over the period of one week.
Latency: 99% of web pages must load within two seconds over a seven days period.
Availability: Ensuring 99.95% availability for creating an EC2 instance in the Canada Central Region for one year, or something like ensure the service has 99.99% to 99.999% uptime over a thirty-day period.

The Dance of the 9s

When we look at the above SLO trends, it is clear that higher availability, shown by more nines, boosts the reliability of any service or system. The more nines, the more reliable the system seems. This is a reasonable assumption for now. However, the challenge lies in the details. Achieving 99%, 99.9%, or even 99.99% availability requires exponentially more effort and money with each additional nine. Let us call this "The Dance of the 9s."

Availability	No. of Nines	Downtime/year	Downtime/month	Downtime/day
99%	Two	~ 5184 minutes	~ 432 minutes	~ 14.40 minutes
99.9%	Three	~ 518.4 minutes	~ 43.2 minutes	~ 1.44 minutes
99.99%	Four	~ 51.8 minutes	~ 4.3 minutes	~ 0.14 minutes
99.999%	Five	~ 5.2 minutes	~ 0.4 minute	~ 0.01 minutes

To understand this better, let us use a simple example. Suppose we aim for 99% uptime for a service over one year. It means that, at maximum, the service can be down for 5,184 minutes in a year, which is equivalent to 432 minutes per month or 14.4 minutes per day. This allowable downtime is commonly known as the Error Budget, which represents around 1% of the total time. Now the question is, 'Can we actually afford this downtime?' Well, there is no straightforward answer because it depends on the context. Yes, you read that right, it really depends.

Let us dig deeper. Imagine this service is a recommendation feature on Notflix or a real-time chat in a live sports streaming app with millions watching. Can we afford 14.4 minutes of downtime a day for this feature? If the recommendation service or chat is working, but we can not watch a movie or a live match for 14.4 minutes because the streaming is down, users will be unhappy. Well, users did not pay for a subscription to chat or see some movie posters. So, they might tolerate some downtime for chat or recommendations in these applications but not for streaming content.

Consider another example: a payment service in the Bmazon app with a 99% uptime target. Can the company afford 14.4 minutes of downtime a day for payments? No, especially during a festive or holiday season. The financial losses could be huge. In this case, we need more nines. A 99.9% uptime (~1.4 minutes per day), 99.99% (~0.1 minutes per day) or even higher is much better.

Now, to widen our understanding of reliability and SLOs, take the example of a sporting event like FIFA World Cup Final. Unlike the previous regular transactions that occur daily, these events have a short duration but are intense and can attract millions concurrent viewers worldwide. Imagine the streaming service has the capacity for thirty million users with 99.99% availability. Now, imagine Messi sprinting towards the goal in the 88th minute. Suddenly, the system crashes because user numbers spike. More fans flock to the platform, and the system, built for thirty million users, falters.

Missing a few seconds of such a game or experiencing any downtime can ruin the viewer's experience. In these moments, even 99.99% reliability is not enough; we need 99.999% or sometimes even higher. The financial stakes are enormous, given the unpredictable nature of these events. Usually viewer traffic can depend on many factors: the event's significance, the teams playing, star players like Ronaldo facing Messi in the finals, key substitutions, or a crucial penalty kick. Notifications about Messi scoring or getting injured can also drive traffic.

Sporting events like the Super Bowl and Cricket also provide great examples, as they often attract millions of concurrent viewers globally. In order to manage these high-stakes events, we can clearly see that streaming companies apart from prioritizing efficient testing and capacity planning also consider how many "nines" of reliability are necessary to ensure seamless streaming experiences throughout the game.

Adding more "nines" to our SLOs can definitely boost service uptime, but it also significantly increases costs and demands more engineering effort. Therefore, when setting SLOs, engineering and SRE teams should collaborate closely with business stakeholders to effectively understand the potential use cases and requirements. This way, we can strike the right balance between availability and practicality. The focus should be on setting flexible, realistic, and achievable SLOs that fit the use case rather than just chasing "The Dance of the 9s."

Additionally, beyond establishing SLOs each time, it is crucial to maintain and evolve the current SLOs over times, wherever feasible. This approach can provide long-term benefits, allowing us to monitor which indicators or SLOs are performing well and are suitable for our use case.

Service Level Agreement (SLA)

SLA are legal promise or more likely of a formal contracts agreed between a service provider and a customer. It defines the expected level of service and the consequences if the predefined SLOs are not met. It typically include service credits or some sort of financial penalties. For example, the school (institution) promises parents that at least 90% of the students will pass their second semester exams with a score at least 70%. If this goal is not met, the school might offer extra tutoring sessions at no cost. This agreement with the parents is the SLA. Some of the commonly used SLAs are:

Response Time: The service provider guarantees that 99% of web pages will load within 2 seconds over any seven days period. If this target is not met, the provider will offer a service credit of 10% of the monthly fee for each 1% below the threshold.
Uptime: The service provider guarantees 99.9% uptime over thirty day. If the target is not met, then a service credit of 10% of the monthly fee for every 0.1% below the threshold will be issued by the provider.

To better understand what are some deliverable promises mentioned within the formal contract, let us consider a real-world scenario. Imagine company A in negotiations with company S which is set to handle all of A's observability and monitoring needs. They both may agree upon some critical SLAs, to ensure that services of Company A is reliable for most of the time.

Alert Delivery Time: The service provider S commits to deliver alerts effectively, let us say within 10 seconds after an incident takes place.
Alert Accuracy: Company S ensures that their alert system is highly accurate, keeping false negative to a minimum. For example, they might guarantee that false positives stay below 1%.
Custom Alerting and Integration: Company S promises to support custom alerts and notifications that seamlessly integrate with other tools like Webhooks, Emails, PagerDuty, Slack and others as soon as the incident is encountered like 12 seconds.
Performance and Throughput: Company S guarantees that certain performance metrics, like response time, transaction speed, or data throughput, will meet the agreed standards, ensuring that the service runs smoothly and efficiently.
Availability of Alerting and Monitoring Services: Company S pledges to keep their alerting and monitoring services up and running 99.99% of the time, offering near-continuous availability that Company A can count on.
Data Retention and Availability: Company S ensures that monitoring, tracing, and logging data will be available and retrievable for a specified period, like 30 days without any loss. The concept of tiered storage of Data Warehousing truly shines in this area.

Similarly, there are several other areas a service provider can mutually agree with a customer and compensate accordingly when related promises or SLOs are not met. Usually, they are in the form of service credits, or additional support hours may be provided by Company S to Company A in this regard.

Closing Thoughts

Therefore, we conclude that in this ever-evolving software symphony, maintaining reliability is crucial to enhance the customer experience for businesses. We should truly appreciate Covey's assertion that accountability breeds responsibility. By embracing the song of SLI, SLO, and SLA won't automatically make our system more reliable. However, it will empower people within organizations to understand when and where to invest their efforts effectively, which could definitely be a step towards making reliable systems.

A song of SLI, SLO and SLA.

Table of contents