Reliability isn’t perfection—it’s predictability, even in chaos.

- Carl Eubanks, The 22nd Time He Told Me He Wanted Veggie Straws On Tuesday (2025).

Now, before I get into anything, there’s some important context I need to provide. What are veggie straws? Well, if we do a quick google search, you may get an answer that says the following: “Veggie straws are a type of snack food made from a combination of…”. This is a lot of unnecessary information, so I’ll save you the trip. They are a processed bag of dehydrated vegetables and potato starch/flour that look like rectangular straws.

The image illustrates an average night in the Eubanks household where I am strapped to the ceiling in order to eat veggie straws without someone taking them.

The veggie straw principle

Every night, right before bath time, my 3.5-year-old son hits me with a familiar request:

We want veggie straws.

Not sometimes. Not occasionally. Every. Single. Time. I can rely on it—just like I can rely on the sun rising, a demo going wrong that was “working 10 minutes ago”, or my son saying “we” when he most definitely does not mean “we”. That’s reliability in its most human form: given the same input, you get the same output. It’s not about whether it’s the “right” output. It’s about consistency.

Reliability in systems

A person is reading "Designing Data-Intensive Applications" aloud. One alpaca, one llama, and a starfish character listen. One alpaca says, "Not again," while the person starts with "In Chapter 1…".

The image depicts a father who, for the 9th time today (it’s only 11:00am), has mentioned something that he read in the book Designing Data-Intensive Applications, as well as a starfish looking—staring at an alpaca and a llama who just spoke.

In chapter 1 of Designing Data-Intensive Applications, reliability is defined as:

A system’s ability to function correctly even when things go wrong.

That breaks down into a few things, but most notably: correctness, fault tolerance, and predictable behavior. First off, we must think of these things as they pertain to the system we are observing or using. With that in mind, correctness refers to the right result being returned when it works. What is this right result? Hint: make sure not to think too hard on that. It’s the predictable behavior. Under the same conditions, you know what will happen. And it continues to give you this favorable—or unfavorable (can’t count out the amazing UX you’ve encountered that may or may have not driven you into a brief state of fight or flight)—outcome even when parts of the system fail.

In other words, a reliable system doesn’t surprise you—unless it surprises you in a consistent way. My toddler, like everyone in the world who stays up rather than creating a healthy bedtime routine, dislikes going to bed. Although it should not, it still surprises me how he never fails to show the utmost resistance when I say the word “Alright” after I have read (or sang) the last of a book or two or five before bed.

On the flip-side, there’s times when something is unreliable. This is why, in my perspective, you must think of reliability as a state of being for a system, with different factors at play. Even if a system is fault-tolerant, it cannot tolerate every fault. Under certain load parameters, a system can be reliable, while with others, it becomes less and less reliable until it can be considered unreliable… under those conditions. So the question becomes: when can I rely on you?

Faults and failures

As aforementioned, reliability is built on fault tolerance, not fault avoidance. We define a fault as something that goes wrong internally (server crash, packet drop, hardware glitch), and a failure as the fault becoming visible to the end user because the system couldn’t handle it.

A fault-tolerant system:

detects the fault (observability, graceful and intentional error handling)
contains the damage (isolation, dead-letter queues, retries with jitter backoff, fallbacks and circuit breakers)
recovers automatically (self-healing, failover)

Why averages lie: the tail latency problem

Averages can make unreliable systems look great on paper (bare with me as I give meaningless numbers with zero context):

“Average latency: 300ms”
…but 1% of users are waiting 9 seconds.

That’s tail latency, and it’s where the real pain lives. Whether we choose to care about that tail is up to the business (fyi: most, in fact, do not). Thus, it’s imperative that you keep those separate from the lived experiences of the commoner, the unremarkable, the quotidian, the 99%.

Percentiles tell the truth:

P50 (median): half of the requests are faster than this.
P95: 95% are faster than this—5% are slower.
P99: only 1% are slower than this.

A system that’s “fast most of the time” but occasionally glacial can still be functionally unreliable.

Building for fault tolerance

If veggie straws are inevitable, plan around them. If faults are inevitable, design around them.

Redundancy: keep spares (replication, standby nodes, a 2nd box of veggie straws from Costco).

The image depicts a father purchasing multiple boxes of veggie straws in an ideal world where his sons will not change their mind the next day on what snacks they “like”. He, without regard for the money in his pocket and the unhinged, extremely malleable brains of his little ones, plans for the “inevitable” case of them eating more, and not less. He is unbothered by the inaccurate sign placed around the alpaca, nor the smiling starfish with shorts who had given him a box of Not Again’s.

Isolation: don’t let one failure take down the whole system.
Graceful degradation: better partial service than total outage.
Monitoring: faults you don’t see are just failures waiting to happen.

The image depicts a father who has just setup a new, state-of-the-art, monitoring system strictly for veggie straw inventory, that are stored in containers reinforced with glass-clad polycarbonate.

Real-world example: Google’s “whoops” moment

One famous outage happened when there was a faulty automated quota update to the API management system (Google Cloud’s Service Control component).

Fault: An update that contained corrupted data, including blank fields, which were then distributed globally.
Failure: Blast radius was extreme. Caused binaries responsible for validating API traffic to enter a crash loop, rejecting all API requests. The lack of proper redundancy meant the failure cascaded throughout the system.

Lesson: Faults are normal. Failures are preventable.

The reliability mindset

A reliable system doesn’t promise nothing will go wrong. Rather, that we’ve thought about what will go wrong, we’ve contained it, and we’ve kept it from ruining the whole experience.

To close, below is a thought experiment that you can take to many situations, not just technical ones:

For the people who stay up late, go to bed at different times each night, and regret it in the morning (hey, some people don’t regret it): what fault(s) did you encounter, and what is the failure that occurred? How can your “system” become fault tolerant in this scenario?

Although there are no wrong answers, there are, most certainly, better ones.

- Carl Eubanks, Looking Both Ways on a One Way Street (2022)

Next in the Series: Maintainability — Choose your hard

What my toddler taught me about fault tolerance