Redundancy in networking is a cornerstone of resilience. From switch port bonds and RAID arrays to dual PSUs and backup links, these systems are designed to keep the lights on when things go wrong. However, the efficacy of redundancy depends not just on having a backup, but on actively monitoring its readiness. Unfortunately, many operators—particularly in South Africa—fail to maintain or monitor their redundant systems effectively, leaving their networks one failure away from catastrophe.

The Illusion of Redundancy

The phrase "We have redundancy; we're all good!" can be misleading. Too often, redundancy is treated as a set-it-and-forget-it solution. Here's how this mindset manifests in real-world scenarios:

Dual Power Failures in Data Centres: The first power supply fails but goes unnoticed. When the second PSU inevitably fails, the entire system goes down.
Backup Links That Don’t Back Up: Redundant links might exist, but they lack the capacity or configuration to handle traffic during a primary link failure.
Fibre Breaks: A secondary fibre link meant to provide failover protection remains broken for weeks or months because no one notices the initial fault until the primary link also fails.
Switch Port Bonds: A failed port in a bond reduces throughput capacity, but unless monitored, the degradation goes unnoticed until critical performance thresholds are breached.

Vendor Solutions vs Reality

Many redundancy solutions provided by vendors assume a higher level of operational discipline than what’s often implemented. Vendor-designed systems typically require:

Proactive Monitoring: Continuous checks on all redundant components to ensure functionality.
Regular Maintenance: Swift action on alarms to restore full redundancy.
Operational Maturity: Structured processes for failure response and repair.

However, in environments where operational processes are less mature, such as some South African operators, these assumptions fall apart. As a result, failures in redundant systems can persist unnoticed, eroding the very resilience they’re supposed to ensure.

The Consequences of Unmonitored Redundancy

Failing to monitor redundancies leads to a cascade of problems:

Increased Downtime: Redundancy is meaningless if both primary and secondary systems fail.
Customer Impact: Outages affect service levels and erode trust, particularly for mission-critical applications.
Operational Inefficiency: Repairing two failures simultaneously is exponentially more challenging than addressing a single fault.

Best Practices for Redundancy Monitoring

To prevent redundancy from becoming a false sense of security, operators must go beyond basic element monitoring. A proactive approach involves pairing redundant components in the network monitoring system and treating redundancy failures as distinct, high-priority alerts.

1. Monitor Redundant Pairs

Pair redundant elements in the network management system (NMS) and track their status together.
Configure the NMS to trigger a distinct "redundancy failure" alert when one component of the pair goes offline, even if the system remains functional.

2. Conduct Regular Health Checks

Automate periodic testing of backup systems to ensure they are operational.
For backup links, simulate traffic failover scenarios to verify capacity and performance.

3. Prioritise Redundancy Alerts

Redundancy failure alerts should be categorised as high priority, even if the system appears to be operating normally.
Use colour coding in dashboards to highlight degraded redundancy states, making them visually prominent.

4. Track Time-to-Repair (TTR)

Monitor how long redundant systems remain in a degraded state.
Establish and enforce strict TTR metrics for restoring redundancy, even if there is no immediate impact on service.

5. Implement Cross-Monitoring for Dependencies

Ensure that redundancy monitoring extends to related systems, such as:
- PSUs: Both power supplies in dual systems should be monitored independently.
- Network Links: Track both the primary and secondary paths.
- Switch Ports: Monitor individual links in port-channel configurations for failure or degradation.

Technical Implementation Example

To implement redundancy monitoring effectively:

Define Pair Relationships:
- In your NMS or telemetry system, define relationships between primary and redundant components.
- Example: Pair fibre paths A and B as "redundant links."
Set Up Redundancy Alerts:
- Create an alert condition that triggers when one component of a pair is offline. For instance:
```
  IF (Component_A == Down AND Component_B == Up)
  THEN Raise "Redundancy Degraded" Alert
```
Visualise Redundancy State:
- Use dashboards to display redundancy health clearly. For example, mark "full redundancy" in green, "degraded redundancy" in yellow, and "critical failure" in red.
Test Regularly:
- Automate failover testing for redundant systems and include results in routine reporting.
Integrate with Escalation Policies:
- Ensure redundancy failures escalate to higher operational tiers if unresolved within a predefined timeframe.

Wrap

Redundancy is not a safety net unless actively monitored and maintained. Operators must shift from a reactive to a proactive mindset, treating redundancy failures as early warning signs rather than inconsequential issues. By pairing redundant components in monitoring systems, prioritising redundancy alerts, and enforcing strict repair timelines, network operators can ensure their redundant systems remain reliable, robust, and ready for action when the inevitable happens.

In a world where uptime is paramount, redundancy monitoring is not optional—it’s essential.

⚠️Redundancy Monitoring | Why One Failure is a Warning & Two is a Disaster🚨