The Art of Site Reliability Engineering- PART 3


In PART 2 we talked about SLI vocabulary in SRE , now lets discuss about Service Level Objective(SLO).
What is Service Level Objective(SLO)?
SLO stands for Service Level Objective, and it is dependent on SLIs. While SLIs are the desired ratio, SLOs measure if we’re achieving the right level, given an indicator.
Organizations define a target for a specific period (per day, per hour, per month…) to determine service quality and availability. The goal must be reasonable and achievable.
SLO Process Overview?
To start building your SLOs, consider following process:
1) List the critical business transactions:
Actions on your system that are critical for your business's survival. If this action fails, it could damage your business and its reputation.
Try to order the critical business transactions by business priority.
Choose a number based on past performance that you know is acceptable. You can set it as a goal, measure it, and revisit it every 2-3 months.
Base it on business metrics, which indicate how happy your customers are with a product.
2) Define Good Indicators:
- Example of good Indicators are response time(speed) and Availability.
3) Define Objective:
- Example for Availability SLI can be tracking 20X codes as good events as numerator and all other valid events (like 50X) as denominator.
SLI = (Good Events/Valid Events) * 100
Usually measured in Percentile.
- Example for Response Time SLI think about what is the acceptable amount of time for users and what is unacceptable, and base your objective on this cutoff point.
It’s helpful to have two cutoffs:
One that is a very long cutoff, like 10 seconds. If something takes longer than 10 seconds, we will call it an error, even if the response is positive.
And one that is a reasonable response time, like 1 second. That’s our desired cutoff time.
4) Define Error Budget:
The error budget is, simply put, the inverse of the SLO.
Example: If you have a 99% SLO, then the error budget is 1%. look at how many requests came in during this time, and we allow a max of 1% errors.
It Involves Slow Burn Rate and Fast Burn Rate.
Example: get a pool of errors upfront, and then you look at how fast you’re burning through them.
Slow Burn: Introduce something new with a bit of error, which trickles and by the end of the period makes you exceed your budget period can be of 28 days as example.
Fast Burn: Push a new configuration, and it’s just broken.
5) Define Alerts:
The response should not be the same for the two types of burn rates. One should be an alert, one should be a ticket.
Fast burn rates usually turn into alerts. It should be an actionable
The slow burn is a ticket-level alert. It needs to be looked at by the end of the period. You have 7/14/28 days to fix this before it becomes a problem.
Subscribe to my newsletter
Read articles from Gaurav directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
