đźš‚ Engineering the Truth | Methodologies for Analysing & Solving Outages in Network Operations


Outages are never just technical glitches—they are symptoms of deeper issues, poor design choices, unvalidated assumptions, or breakdowns in operational processes. Every service interruption is both an opportunity and a responsibility: to investigate not only what broke, but why it broke, and how it can be prevented in the future.
This article lays out a methodology network operators and engineers can adopt for systematically analysing and resolving the causes behind major incidents, classifying them accurately, and using the insights gained to fortify the network architecture and operational response.
Major Incidents & Crises | The Canary in the Coalmine
The causes of major incidents are rarely isolated. A failed network device might point to an overloaded system, a flawed change process, or even a broken monitoring threshold. Equally telling is how an incident was handled—delays in detection, misclassification, misrouted support tickets, or poor communication often reveal weaknesses in operational readiness.
A structured methodology is essential, starting with classification and culminating in a “lessons learnt” review that feeds a live knowledge base.
Classification | Your First Defence Against Chaos
Effective classification at the moment of triage is the first step in establishing control and accountability. Classification ensures:
Incidents are routed to the correct team without delay.
Workarounds and solutions from previous incidents can be quickly retrieved.
Diagnostics are streamlined with the right supporting data.
Historical patterns are recognised and added to a knowledge base.
The incident is handled with the appropriate urgency and visibility.
Operational Outage Classification Matrix
Priority | Response Model |
Critical | All-hands, vendor escalation, leadership informed. |
High | Technicians diverted from other work; immediate triage. |
Medium | Standard processes within working hours. |
Low | Background effort, scheduled as resources permit. |
Each priority is driven by urgency, scope, and business impact.
Defining the Impact | Service Period & Consequence
Categorising an outage requires a clear understanding of two dimensions:
1. Service Period
Measured in downtime and degradation:
Critical – Link or service down > 4 hours; business impacted > 1 month.
Major – Down > 1 hour or degraded > 4 hours; impact > 1 week.
Moderate – Down > 30 min or degraded > 1 hour; impact > 1 day.
Minor/Low – Impact measured in minutes or hours; minimal business effect.
2. Service Consequence
Measured in financial, reputational, or legal terms:
Critical – R100m+ loss, litigation, death/disability.
Major – R10m+ loss, sanctions, serious brand damage.
Moderate – R1m+ loss, embarrassment, hospitalisation.
Minor/Low – R50k–R100k loss, process breaches, irritation.
Root Cause Analysis | Solving the Problem Behind the Problem
When diagnosing a major incident, resist the urge to leap to a fix. Instead, ask:
Was the incident detected early enough?
Were the diagnostics and tools sufficient?
Was there a delay in resolution?
Were workarounds applied, and could they be improved?
Break the problem down:
Understand each sub-component.
Consider external triggers like changes or unplanned maintenance.
Avoid solving the symptom instead of the root issue.
đź’ˇ Pro tip: The initial description of a problem often reflects a preconceived solution. Challenge it.
Institutional Memory | Lessons Learnt, Not Repeated
Learning from incidents requires more than postmortems; it demands institutionalising those lessons.
The After Action Review (AAR)
A concept born out of military necessity (first used by the U.S. Marines post-Iwo Jima), the AAR is a focused discussion on:
What was supposed to happen?
What actually happened?
Why was there a difference?
What can we learn from this?
This methodology also underpins how NASA dissected the Apollo 13 crisis—by examining teamwork, communication, decision-making, and innovation under pressure.
Creating a Living Knowledge Base
Every incident solved is a future incident prevented—if the resolution is documented properly. A central, searchable knowledge base is not optional, it’s an operational amplifier:
Saves time by reusing known fixes.
Reduces dependency on institutional memory stuck in individual heads.
Enhances cross-team collaboration.
Improves MTTR (mean time to repair) across the board.
Tips for Structured Problem Solving
Examine the problem before seeking a solution.
Break the issue into bite-sized parts.
Formulate the right questions—they drive the quality of answers.
Avoid the trap of solving a familiar but irrelevant problem.
Generate multiple solution options before committing.
Ensure your solution is technically sound and socially acceptable.
Don’t delay—procrastination gives problems space to grow.
Denial is not a mitigation strategy.
Wrap | Outages Are Inevitable—Stupidity Isn’t
No operation is perfect. But that doesn’t mean we shouldn’t strive for operational excellence. Classify smartly. Solve deeply. Document diligently. And above all—learn.
Because a failure is only a failure if you repeat it.
Subscribe to my newsletter
Read articles from Ronald Bartels directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Ronald Bartels
Ronald Bartels
Driving SD-WAN Adoption in South Africa