⚠️ From Outage to Insight | Managing Significant Havoc in Technology 🛠️


In the world of technology, there’s a familiar, if crude, expression: SHIT happens. But in IT, it’s not just an offhand remark—it’s an inevitable reality. Significant Havoc in Technology (SHIT) is always lurking, waiting to disrupt operations, affect customers, and tarnish reputations. Despite our best efforts to design, automate, test, and defend against failure, complex systems fail. Not occasionally. Inevitably. 😵💫⚠️🛠️
This article outlines why outages and incidents are bound to occur, drawing on the foundational work of Dr. Richard Cook’s “How Complex Systems Fail” to show that the real challenge lies not in pretending SHIT won’t happen—but in preparing for it when it does. 📚💡📊
1. Complexity Is Inherently Hazardous 🕸️🧨🔍
Dr. Richard Cook’s research reveals several uncomfortable truths about complex systems, all of which apply perfectly to modern IT environments:
Systems are intrinsically hazardous: They operate with a combination of interdependent parts, each capable of introducing failure.
Defences exist but are incomplete: While we build layers of redundancy and monitoring, outages typically occur due to multiple, compounding failures.
Latent risks hide in the system: Unused code paths, hidden misconfigurations, and outdated dependencies are all time bombs waiting for a trigger.
Degraded operation is the norm: Systems often run on workarounds, manual interventions, and silent compromises. We just don’t always see it.
2. The Root Cause Myth & the Human Factor 🧠🔄🧑💻
There’s a dangerous comfort in pointing to a root cause, especially when it’s labelled as human error. But in truth:
Root cause thinking is flawed: Outages are almost always the result of multiple contributory factors.
Hindsight bias distorts the analysis: Once we know what failed, we too easily blame people for not preventing it.
Humans are not the problem—they are the solution: Operators are both the creators and defenders of system integrity. They resolve ambiguity, adapt in real-time, and create safety dynamically.
As a colleague often says: We play the bell curve. Networks are as bound by game theory as they are by physics. All actions in operations are calculated gambles, and the best ones are based on experience, not rigid rules. 🎲📈📘
3. SHIT is Always Just Around the Corner 🛑🧯🌩️
The concept of “crisis always being near” is a reality in IT. Technology environments grow more complex by the day, with each new service, integration, and update introducing fresh failure modes.
Every change introduces new forms of outages
Failures are not isolated events—they are systemic consequences
The absence of outages doesn’t imply stability
Murphy’s Law—"Anything that can go wrong, will"—isn’t a joke in IT. It’s a risk posture. 🙃📉📍
4. Learning from the Misses | Incident Pyramids & Precursor Events 🧱📄🔎
Heinrich’s Incident Pyramid, a model from industrial safety, proposes that for every significant incident, there are many more minor ones. While the exact ratio in IT may be elusive, the principle remains powerful:
Significant outages are preceded by many lesser incidents
Documenting and analysing these “misses” builds resilience
Failure-free operations require experience with failure
Just because an incident didn’t result in downtime doesn’t mean it can be ignored. Misses matter. 🫣📊🧩
5. IT’s Broken View of Crisis Management 🧯🏚️📉
Look at today’s job listings: crisis management in IT is often viewed as a low-tier function. Major incident managers are relegated to secondary support roles. In contrast, other industries—aviation, nuclear, military—treat crisis leadership as a specialised and respected discipline.
IT lacks maturity in crisis management
Specialist roles for incident commanders and safety engineers are rare
The Incident process is often ad hoc, not well-rehearsed
This must change. IT must embrace structured operations and learn from data centre protocols to instil crisis discipline. 📚🎓🔐
6. Building a Proactive Approach to SHIT 🛠️🧠🏗️
SHIT will happen—but how you deal with it defines your operational maturity. A well-designed Major Incident process should:
Be pre-defined, not created on the fly
Provide a central point of truth for decision making
Feed back into preventive actions and architectural improvements
Align with other governance and operational processes
When well executed, it improves agility, resilience, and leadership confidence. 💪📘🧭
7. Data Centre Assessments, Documentation, & Dealing with Murphy 🧾📋👨🔧
Conducting formal assessments, maintaining an up-to-date incident log, and performing root cause analyses (plural, not singular!) are key. And let’s not forget the most important job:
Show Murphy who’s boss. 😈🛡️📏
Design to detect degraded states
Validate assumptions regularly
Invest in engineering to preempt real failures
Wrap | The Future of Stability Requires a Change in Mindset 🧠🔄🚦
Technology powers everything—from healthcare and banking to traffic lights and toasters. When it fails, the consequences ripple far and wide. It’s no longer acceptable to wait for SHIT to hit the fan and scramble for duct tape. 💥🩹🚫
Just because outages aren’t recorded doesn’t mean they aren’t happening.
Just because there hasn’t been a crisis yet doesn’t mean one isn’t imminent.
If you want IT to be reliable, you must build processes that accept failure, study failure, and adapt in the face of failure. 🔁📚🧠
In Forrest Gump, it’s claimed the character invented the phrase SHIT happens. In reality, it’s been part of life for over a century. In tech, it stands for something even more real:
Significant Havoc in Technology.
And it happens. Always. Be ready for it. 🧯🧨🚧
Subscribe to my newsletter
Read articles from Ronald Bartels directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Ronald Bartels
Ronald Bartels
Driving SD-WAN Adoption in South Africa