Outages are never just technical glitches—they are symptoms of deeper issues, poor design choices, unvalidated assumptions, or breakdowns in operational processes. Every service interruption is both an opportunity and a responsibility: to investigate not only what broke, but why it broke, and how it can be prevented in the future.

This article lays out a methodology network operators and engineers can adopt for systematically analysing and resolving the causes behind major incidents, classifying them accurately, and using the insights gained to fortify the network architecture and operational response.

Major Incidents & Crises | The Canary in the Coalmine

The causes of major incidents are rarely isolated. A failed network device might point to an overloaded system, a flawed change process, or even a broken monitoring threshold. Equally telling is how an incident was handled—delays in detection, misclassification, misrouted support tickets, or poor communication often reveal weaknesses in operational readiness.

A structured methodology is essential, starting with classification and culminating in a “lessons learnt” review that feeds a live knowledge base.

Classification | Your First Defence Against Chaos

Effective classification at the moment of triage is the first step in establishing control and accountability. Classification ensures:

Incidents are routed to the correct team without delay.
Workarounds and solutions from previous incidents can be quickly retrieved.
Diagnostics are streamlined with the right supporting data.
Historical patterns are recognised and added to a knowledge base.
The incident is handled with the appropriate urgency and visibility.

Operational Outage Classification Matrix

Priority	Response Model
Critical	All-hands, vendor escalation, leadership informed.
High	Technicians diverted from other work; immediate triage.
Medium	Standard processes within working hours.
Low	Background effort, scheduled as resources permit.

Each priority is driven by urgency, scope, and business impact.

Defining the Impact | Service Period & Consequence

Categorising an outage requires a clear understanding of two dimensions:

1. Service Period

Measured in downtime and degradation:

Critical – Link or service down > 4 hours; business impacted > 1 month.
Major – Down > 1 hour or degraded > 4 hours; impact > 1 week.
Moderate – Down > 30 min or degraded > 1 hour; impact > 1 day.
Minor/Low – Impact measured in minutes or hours; minimal business effect.

2. Service Consequence

Measured in financial, reputational, or legal terms:

Critical – R100m+ loss, litigation, death/disability.
Major – R10m+ loss, sanctions, serious brand damage.
Moderate – R1m+ loss, embarrassment, hospitalisation.
Minor/Low – R50k–R100k loss, process breaches, irritation.

Root Cause Analysis | Solving the Problem Behind the Problem

When diagnosing a major incident, resist the urge to leap to a fix. Instead, ask:

Was the incident detected early enough?
Were the diagnostics and tools sufficient?
Was there a delay in resolution?
Were workarounds applied, and could they be improved?

Break the problem down:

Understand each sub-component.
Consider external triggers like changes or unplanned maintenance.
Avoid solving the symptom instead of the root issue.

💡 Pro tip: The initial description of a problem often reflects a preconceived solution. Challenge it.

Institutional Memory | Lessons Learnt, Not Repeated

Learning from incidents requires more than postmortems; it demands institutionalising those lessons.

The After Action Review (AAR)

A concept born out of military necessity (first used by the U.S. Marines post-Iwo Jima), the AAR is a focused discussion on:

What was supposed to happen?
What actually happened?
Why was there a difference?
What can we learn from this?

This methodology also underpins how NASA dissected the Apollo 13 crisis—by examining teamwork, communication, decision-making, and innovation under pressure.

Creating a Living Knowledge Base

Every incident solved is a future incident prevented—if the resolution is documented properly. A central, searchable knowledge base is not optional, it’s an operational amplifier:

Saves time by reusing known fixes.
Reduces dependency on institutional memory stuck in individual heads.
Enhances cross-team collaboration.
Improves MTTR (mean time to repair) across the board.

Tips for Structured Problem Solving

Examine the problem before seeking a solution.
Break the issue into bite-sized parts.
Formulate the right questions—they drive the quality of answers.
Avoid the trap of solving a familiar but irrelevant problem.
Generate multiple solution options before committing.
Ensure your solution is technically sound and socially acceptable.
Don’t delay—procrastination gives problems space to grow.
Denial is not a mitigation strategy.

Wrap | Outages Are Inevitable—Stupidity Isn’t

No operation is perfect. But that doesn’t mean we shouldn’t strive for operational excellence. Classify smartly. Solve deeply. Document diligently. And above all—learn.

Because a failure is only a failure if you repeat it.

https://youtu.be/G2NvbrcQ3_4

🚂 Engineering the Truth | Methodologies for Analysing & Solving Outages in Network Operations