🦈Network Operations Centres | The Mission Control for Business Continuity🎏

Ronald BartelsRonald Bartels
7 min read

A Network Operations Centre (NOC) is akin to the Mission Control rooms made famous by space programs like NASA's Apollo missions. Much like how Mission Control in Houston orchestrated every aspect of a space flight, a NOC is the heartbeat of an organization's IT operations, ensuring everything runs smoothly and, when necessary, responding swiftly to incidents that could disrupt business continuity.

The Evolution of NOCs | From POTS to Modern IT

NOCs have been around since the days of Plain Old Telephone Service (POTS), gradually evolving into the sophisticated command centres we see today. These centres are no longer just about keeping the lights on—they are integral to ensuring uninterrupted business operations. In today's complex IT environments, a NOC is more critical than ever, acting as the first line of defense against disruptions that can impact a company’s bottom line.

NOCs have become for sophisticated during the years:

Business Continuity | More Than Just a Disaster Recovery Plan

Traditionally, business continuity and disaster recovery (BC/DR) have been treated as separate processes. However, this separation can lead to inefficiencies. In reality, business continuity is a specialized response to major incidents—a crisis where a full workaround is required. The trigger for activating a disaster recovery plan always stems from a major incident. If this trigger comes from a reactive system like a service desk, it indicates a failure in proactive monitoring.

This is where the NOC comes into play. The NOC, not the service desk, should be the trigger point for major incidents. By monitoring the network in real-time, the NOC can detect potential issues before they escalate into full-blown crises. It ensures that business continuity plans are activated proactively, not reactively.

The Modern NOC | A Blueprint for Success

Drawing parallels to the Apollo Mission Control, a well-designed NOC should have all disciplines of operations represented within it. The command structure should be integrated directly into the NOC, with the manager positioned at the coalface, similar to the Director of Flight Operations at NASA. This setup allows for quick decision-making and a coordinated response to incidents.

A modern NOC is organized into three functional areas:

  1. Command and Control: The nerve centre, where the NOC manager (equivalent to the Director of Flight Operations) oversees operations.

  2. Tiger Teams: Specialized teams that handle specific aspects of network management, such as security, server management, and data centre operations.

  3. The Trenches: The front line where metrics are monitored and analyzed in real-time. This area is staffed by experienced engineers who rotate through the NOC to ensure knowledge transfer and maintain a high level of operational awareness.

In larger organizations, the NOC might be further divided, with additional rows of consoles dedicated to specific functions like capacity management, performance monitoring, and social media surveillance. The latter is crucial in today’s world, where outages are often first reported on platforms like Twitter.Most often business continuity is described as a separate process that caters for a disaster. However a separate process creates inefficiencies as in reality, business continuity is a special case of the major incident process (crisis) whereby a full workaround is required. The trigger for business continuity or any disaster recovery initiation will always be as a result of an escalation from a major incident.

So if the trigger for the implementation of the disaster recovery plan is via the major incident process, where is the trigger for major incidents? The trigger is definitely not via the service desk because if that is the case then the Information Technology (IT) and other processes have failed! If an event has occurred that has severe negative business consequences and the mechanism that this becomes know is via a reactive call ticketing system then it is obvious that there is no proactive measures in place.

The trigger for a major incident should be escalated from the NOC, not the Service Desk. The NOC should have all disciplines of operations represented within it and the command structure of the NOC should be within that area and not external. The boss should be there at the coalface and not in a separate room. A great example for the NOC is the Mission control used during the Apollo programme. The layout is below and shows the Director of Flight Operations (the boss) in the prime spot.

The Apollo Mission Control went live on 3rd June 1965, nearly 60 years ago. In space flight operations the configuration works and over time the vintage consoles and switches have been replaced by modern computers such as laptops and PCs.

Besides the history as provided by Mission control above, the objectives of any NOC should be:

  • Provide a co-ordinated, coherent and effective response to managing a major incident;

  • This is a physical location and is one of the control points for a crisis;

  • Contains the requisite tools and technology to assist in managing major incidents;

  • Manages, monitors or diverts a crisis threatening the organization or its stakeholders;

  • Deals with major incidents or other potential threats;

  • Houses the crisis management teams including the command and control structure; and

  • Communicates during a major incident including being the hub for internal and external notifications and escalations.

The above is a representation of a modern day NOC. A fishbowl is typically a glass wall. The NOC manager assumes the position of Director of Flight Operations at position (1) and the Service Delivery Manager who is responsible for customer interactions assumes position (4). Positions (2) and (3) and taken up other operations and data centre management as well as the major incident manager who also has his own console, typically the furthest location to the right rear at position (3). Position (2) is also often used by shift supervisors in the NOC. The NOC thus has three functional areas or stages namely: command and control, tiger teams and the trenches. The latter is the team that actually views and analyses displayed metrics. This might consist of three rows of consoles as a minimum but each functional area might have more rows in larger organizations especially in the trenches. This type of NOC is commonly called a three level NOC. Each of the sections are sign posted with a sign suspended from the ceiling and hanging perpendicular to the video wall. Alternatively, the consoles can be labelled using a Brother label printer. Below is a simplistic profile view.

The front row which is “the Trench” is where the steely eyed rocket men sit. The CMOC is the location for all command and control of operations. Some company have a Security Operations centre (SOC). The NOC is the single point of contact and control and a separate location such as a SOC should not exist. Physical, infrastructure and organizational security exist as positions in the trenches.

Besides InfoSec and PhySec mentioned above the other consoles should be manned by specialists in networks, storage, servers, data centre environmental including electrical, cooling including a safety console for fire and water. There are also consoles for faults/outages, capacity/performance. One of the responsibilities in the NOC which is often overlooked is the monitoring of social media networks. Outages and failures in the modern world are often first reported on these platforms. It can be as simple as a search string of your company name in social media! The NOC does not duplicate or handle the functions of the service desk. The position to the front right is taken up by the NOC support resources who is responsible for the technology used in the NOC.

An important consideration is to rotate experienced engineers through the trenches to transfer knowledge. This means that even the highly paid third level engineers need to spend some time on a schedule in the trenches. It might only be a few hours per month but it is an important part of Genchi Genbutsu!

Aviation is also a benchmark for operations and many of the command and control functions can be assimilated from air traffic control

The Bare Necessities Dashboard

I serious doubt any NOC in the world is mature enough to have the bare necessities which is the information below available on a continuously rolling slide deck.

Command & Control

SLA Statistics

Ongoing SLA or Contract Violations

Last 10 Maintenance Tasks Completed

Next 10 maintenance tasks scheduled

Planned Continuity Tests Scheduled

Resources Available to the NOC

Resources Unavailable to the NOC

Changes Completed during the Past Week

Changes Scheduled for the Next Week

Emergency Changes Completed & Major Incidents in Progress

Top 10 Ongoing Projects of which the NOC should be Aware

Top 10 Devices with Temperature or Cooling Alerts

Known Problems or Symptoms of Degradation


0
Subscribe to my newsletter

Read articles from Ronald Bartels directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ronald Bartels
Ronald Bartels

Driving SD-WAN Adoption in South Africa