đź“ťDiagnosing Network Problems | A Comprehensive Checklistâś…

Ronald BartelsRonald Bartels
13 min read

Network issues can bring business operations to a halt, making it vital to have a structured and thorough diagnostic approach. This checklist provides a step-by-step guide to identifying and resolving network issues. Each step is explained in detail, highlighting what to check, how to perform the check, and the potential symptoms or issues it addresses.


Critical Checks

These checks focus on the core of the problem and aim to address the most likely and impactful issues.

1. Description of Incident or Outage

  • What to Do: Document the symptoms observed (e.g., loss of connectivity, degraded services, or intermittent issues).

  • Why It’s Important: Understanding the problem's scope helps prioritize efforts and communicate effectively with stakeholders.

2. SLA Targets Achieved?

  • What to Do: Review whether Service Level Agreement (SLA) targets are being met for downtime or performance metrics.

  • Symptoms: SLA breaches may indicate systemic issues.

  • Fix: Escalate to higher-tier support or re-prioritize resources.

3. Last Known Change on Component Impacted

  • What to Do: Check change logs for recent updates to configurations or hardware.

  • Symptoms: New issues often correlate with recent changes.

  • Fix: Roll back changes if necessary.

4. Suspect Weather Conditions?

  • What to Do: Review local weather conditions for storms or extreme temperatures.

  • Symptoms: Weather can cause signal loss, cable damage, or power outages.

  • Fix: Mitigate environmental effects or schedule repairs.

  • What to Do: Verify power supplies and UPS systems.

  • Symptoms: Unresponsive hardware or intermittent outages.

  • Fix: Address power disruptions and ensure redundancy.

6. Component Checks

  • What to Do: Inspect the hardware for faults or warnings.

  • Symptoms: Alerts from SNMP or CLI interfaces.

  • Fix: Replace faulty components or update firmware.

7. Component Utilization Problems

  • What to Do: Monitor CPU, memory, and bandwidth usage.

  • Symptoms: High utilization causes slow responses or dropped packets.

  • Fix: Optimize load or upgrade resources.

8. Component Accessibility

  • What to Do: Check CLI and SNMP access to affected devices.

  • Symptoms: Inaccessibility may indicate hardware or network issues.

  • Fix: Restore management access or replace the device.

9. Ethernet Port Configuration

  • What to Do: Verify port speed and duplex settings.

  • Symptoms: Mismatched settings cause collisions and slow speeds.

  • Fix: Align settings to match connected devices.

10. CRC Errors

  • What to Do: Inspect ports for Cyclic Redundancy Check (CRC) errors.

  • Symptoms: Errors lead to packet retransmissions.

  • Fix: Replace faulty cables or clean fibre connectors.

11. Cabling Visual Check

  • What to Do: Physically inspect cables for damage or improper connections.

  • Symptoms: Frayed cables or loose connections impact signal quality.

  • Fix: Replace or secure cables.

12. Correct Rate Limit Applied

  • What to Do: Validate rate limits against the customer order.

  • Symptoms: Misconfigured limits reduce throughput.

  • Fix: Adjust configurations.

13. Congestion on Transmission Paths

  • What to Do: Check traffic patterns and load.

  • Symptoms: High latency or packet loss.

  • Fix: Re-route traffic or upgrade bandwidth.

14. Layer 2 Loop Symptoms

  • What to Do: Look for STP logs or unusual broadcast storms.

  • Symptoms: Network instability and slowdowns.

  • Fix: Correct loop sources or enable spanning tree protocols.


Important Checks

These provide additional context and help isolate complex issues.

1. Impact on Business

  • What to Do: Identify how the issue disrupts operations.

  • Symptoms: Lost revenue, reduced productivity.

  • Fix: Focus remediation on critical business processes.

2. Incident Timeline

  • What to Do: Record the start, detection, and resolution times.

  • Symptoms: Delayed responses exacerbate problems.

  • Fix: Improve monitoring and alerting.

3. Hardware Failures

  • What to Do: Check logs for hardware-related errors.

  • Symptoms: Device malfunctions or overheating.

  • Fix: Replace hardware.

4. Ethernet OAM Tests

  • What to Do: Run Ethernet OAM diagnostics.

  • Symptoms: Link issues on the last mile.

  • Fix: Coordinate with ISPs for repairs.

5. Signal Loss or Fibre Damage

  • What to Do: Test fibre optic links for signal strength.

  • Symptoms: Intermittent or no connectivity.

  • Fix: Clean, replace, or re-splice fibre cables.


Informational Checks

These help gather background information and provide clarity.

1. Photos & Inspections

  • What to Do: Take photos and perform physical inspections.

  • Symptoms: Documentation aids in troubleshooting.

  • Fix: Identify overlooked physical issues.

2. Solution Diagrams & Configurations

  • What to Do: Reference network diagrams and configurations.

  • Symptoms: Misconfigurations can cause unexpected behaviour.

  • Fix: Update diagrams and verify configurations.


Timelines

Recording event timings is critical for SLA analysis and post-mortem reviews:

  • Time of Incident Start: Establish the onset.

  • Time of Detection: Measure responsiveness.

  • Time of Repair and Recovery: Document resolution steps.

  • Downtime Duration: Understand the impact.


By systematically applying checklists, you can diagnose and resolve network problems efficiently, minimising downtime and ensuring reliable service for your business.

Investigating Proximate Causes in Network Diagnoses & Troubleshooting

When troubleshooting network issues, identifying proximate causes is critical to resolving incidents quickly and effectively. The following detailed checks focus on physical, configuration, and transmission layer elements, which are often the root causes of network disruptions.


Cabling & Physical Connections

  1. Visual Check of Cabling

    • Description: Inspect cables for visible damage, loose connections, or improper routing.

    • Symptoms: Frayed or tangled cables, improper bends, or exposed wires may lead to connectivity loss or degradation.

    • Resolution: Replace damaged cables and secure connections to avoid further disruptions.

  2. Photos of Cabling

    • Description: Take photos for documentation and remote assessment.

    • Symptoms: Visual records help identify overlooked issues.

    • Resolution: Share images with team members or vendors for additional insights.

  3. Patches and Fibre Optic Cables

    • Description: Examine patch cables and fibre optics for damage or improper terminations.

    • Symptoms: Bent, cracked, or improperly connected patches can cause high attenuation.

    • Resolution: Replace or reterminate damaged patches and clean fibre connectors.

  4. SFP/XFP Ports

    • Description: Inspect transceiver modules for physical damage or malfunction.

    • Symptoms: Link flaps or complete link failures may result from damaged ports.

    • Resolution: Replace faulty transceivers.


  1. Link Testing

    • Description: Perform diagnostic tests on the link, such as loopback or BER (Bit Error Rate) tests.

    • Symptoms: Errors indicate signal quality or transmission issues.

    • Resolution: Reconfigure or replace faulty transmission paths.

  2. Maximum Link Lengths

    • Description: Verify that cable lengths adhere to standard limits (e.g., Ethernet maximum of 100m for copper).

    • Symptoms: Signal degradation and increased packet loss occur when limits are exceeded.

    • Resolution: Replace cables with appropriate lengths or install intermediate equipment.

  3. Fibre Pigtails

    • Description: Ensure RX (Receive) and TX (Transmit) ends are correctly connected.

    • Symptoms: Incorrect connections cause no signal transmission.

    • Resolution: Swap connections as necessary and inspect for half-breaks.

  4. Fibre Attenuation Limits

    • Description: Measure signal attenuation to ensure it is within allowable limits.

    • Symptoms: High attenuation leads to data loss and signal degradation.

    • Resolution: Clean connectors, replace damaged cables, or add signal boosters.

  5. Link Frequency/Type Compatibility

    • Description: Confirm transceivers and fibres match in type and wavelength.

    • Symptoms: Mismatched components result in no link or poor performance.

    • Resolution: Use compatible equipment.

  6. Cleaning Fibre

    • Description: Clean connectors to remove dust or debris.

    • Symptoms: Dirty connectors lead to signal loss and high error rates.

    • Resolution: Use fibre cleaning kits before reconnection.


Configuration & Logical Checks

  1. Rate Limits

    • Description: Verify customer-ordered bandwidth limits are correctly applied.

    • Symptoms: Misconfigurations can result in slow or throttled connections.

    • Resolution: Adjust configurations to align with SLAs.

  2. Management VLANs

    • Description: Ensure VLANs are correctly provisioned for management traffic.

    • Symptoms: Incorrect VLANs disrupt device access and monitoring.

    • Resolution: Update VLAN configurations.

  3. Broadcast Traffic

    • Description: Monitor broadcast and unicast traffic ratios for anomalies.

    • Symptoms: Excessive broadcasts cause network congestion.

    • Resolution: Apply broadcast filtering and optimise traffic distribution.

  4. IP/Subnet Configuration

    • Description: Verify the assigned IP, subnet mask, and gateway configurations.

    • Symptoms: Misconfigured addresses prevent connectivity or cause routing issues.

    • Resolution: Correct the network configurations as needed.


Advanced Transmission & Layer-2 Checks

  1. Ping and MTR Tests

    • Description: Conduct pings with varying packet sizes and MTR tests to assess connectivity and path quality.

    • Symptoms: Packet loss or high latency indicates transmission issues.

    • Resolution: Investigate faulty paths or congestion points.

  2. Congestion Issues

    • Description: Check for bottlenecks on primary or backup transmission paths.

    • Symptoms: High latency and packet drops occur under heavy traffic.

    • Resolution: Re-route traffic or upgrade link capacity.

  3. Layer 2 Loops

    • Description: Inspect for spanning tree issues or broadcast storms.

    • Symptoms: Network-wide slowdowns or outages.

    • Resolution: Enable STP or address the source of loops.

  4. Path Protection

    • Description: Check path protection configurations and logs for flapping.

    • Symptoms: Intermittent disruptions and instability in redundancy mechanisms.

    • Resolution: Correct misconfigurations and stabilise paths.


Wireless & Radio-Specific Issues

  1. Radio Interference

    • Description: Evaluate potential sources of interference affecting radio links.

    • Symptoms: Dropped signals or reduced throughput.

    • Resolution: Eliminate self-interference and external interference sources.

  2. Line of Sight (LOS)

    • Description: Inspect for physical obstructions or misalignment.

    • Symptoms: Signal degradation or complete loss of radio links.

    • Resolution: Align antennas and clear obstructions.

  3. Link Synchronisation

    • Description: Ensure links are synchronised for optimal performance.

    • Symptoms: Out-of-sync links cause jitter and packet loss.

    • Resolution: Resynchronise and optimise configuration.


General Path & Performance Considerations

  1. MTU Alignment

    • Description: Verify MTU settings along the transport path.

    • Symptoms: Mismatched MTUs cause fragmentation and performance degradation.

    • Resolution: Align MTUs across devices and paths.

  2. Throughput and Distance

    • Description: Measure throughput and verify it meets expected performance for the link’s distance.

    • Symptoms: Low throughput indicates potential signal loss or hardware limitations.

    • Resolution: Adjust configurations or upgrade components.


By addressing these detailed checks systematically, network teams can effectively identify and resolve root causes of connectivity or performance issues. Each step provides insights that contribute to faster resolution times and improved network reliability.

Comprehensive Component Checklist for Diagnosing Outages & Faults

When investigating network outages or faults, conducting thorough checks on individual components is essential. The following checklist ensures a systematic approach to identifying and resolving issues, focusing on the physical, operational, and logical aspects of network components.


Component Health and Status

  1. Issues Identified by Component Checks

    • Review diagnostic logs and status indicators for any highlighted issues.
  2. Fan Speed Status

    • Ensure fan speeds are within manufacturer-recommended limits to prevent overheating.
  3. Temperature Status

    • Verify that the component operates within safe temperature ranges; overheating can cause performance degradation or failure.
  4. Power Supply Status

    • Confirm power supplies are functional, with no alarms or voltage irregularities.
  5. Power-On Self-Tests (POSTs)

    • Ensure the component passes POSTs without errors, which could indicate hardware faults.
  6. Unexplained Resets or Violations

    • Investigate logs for unexpected resets, violations, or reboots, which may signal deeper issues.

Resource and Configuration Checks

  1. CPU, Memory, and File System Status

    • Verify that all resource utilization metrics are within acceptable limits to avoid performance bottlenecks.
  2. VLAN Configuration

    • Check that VLANs are correctly provisioned and correspond to the network design.
  3. Firmware Status

    • Ensure the component runs the correct firmware version.

    • Review release notes of the latest firmware for fixes related to the observed problems.

  4. MAC Address Learning

    • Confirm the correctness of learned MAC addresses to avoid loops or forwarding issues.

Performance and Testing

  1. RFC2544 Test Results

    • Review any available benchmarking tests for latency, throughput, and jitter.
  2. SLA Measurements Using Y.1731

    • Verify SLA compliance by assessing delay, loss, and jitter measurements.
  3. Utilization and Capacity

    • Check if component utilization exceeds thresholds or if offered capacity is being exceeded.
  4. Separate Testing

    • Perform isolated tests on the component to rule out network-wide dependencies.
  5. Wireshark Analysis

    • Examine traffic involving the component to detect anomalies, errors, or performance issues.
  6. Ethernet Port Settings

    • Validate the port speed and duplex settings to ensure compatibility and optimal performance.

Physical and Logical Port Checks

  1. Ethernet Port Statistics

    • Assess port statistics for errors, drops, or anomalies.

    • Check for CRC errors, pause frames, or traffic inconsistencies.

  2. Link LEDs

    • Ensure link LEDs indicate proper cable connection and signal status.
  3. Cable and Radio Connections

    • Verify cable integrity and connections, as well as the status of any connected radio devices.
  4. Traffic Counters

    • Confirm bidirectional traffic incrementing on both facing ports of the component/link.
  5. Port Resets

    • Check if ports have been administratively or physically reset and assess the impact.

Switch Configuration and Documentation

  1. Switch Port Configuration

    • Validate that switch ports are configured correctly, including VLAN assignments and descriptions.
  2. Disabled Ports

    • Identify ports disabled due to faults and investigate underlying causes.
  3. Port and Cable Labelling

    • Ensure all ports and cables are accurately labelled and correspond to network documentation.
  4. Physical Connections

    • Confirm physical connections align with the documented topology.

Advanced Tests and POE Components

  1. Ethernet OAM on Last Mile Link

    • Perform Ethernet Operations, Administration, and Maintenance (OAM) to assess the health of the last-mile link.
  2. POE Functionality

    • Ensure all Power over Ethernet (POE) components are supplying power and functioning correctly.

By systematically working through this checklist, network teams can gain detailed insights into component-level issues, enabling more precise diagnoses and faster resolutions for outages or faults.

Detailed Checklist for Transport & Network Path Diagnostics

When diagnosing outages and faults in transport and network paths, a systematic approach to verifying physical, configuration, and operational elements can help identify the root cause. The checklist below outlines key checks to ensure all potential factors are examined.


Physical Layer Checks

  1. Cabling Visual Inspection

    • Conduct a visual check of the cabling to ensure there are no physical defects or irregularities.

    • Verify that photos of the cabling are available for documentation and comparison.

  2. Patch Cable Integrity

    • Examine patch cables for visible damage, such as cuts, fraying, or wear and tear.
  3. SFP/XFP Status

    • Inspect the port SFP/XFP modules for physical damage or loose connections.
  4. Fibre Pigtails

    • Ensure fibre pigtails are correctly connected, with RX linked to TX.

    • Check for half breaks or signs of wear in fibre pigtails.

  5. Link Length and Attenuation

    • Verify that the link lengths do not exceed the specified maximums for the cable type.

    • Test fibre attenuation to ensure it falls within allowable limits.

  6. Fibre Cleanliness

    • Confirm that fibre connectors have been recently cleaned to avoid signal loss due to debris.
  7. Signal Loss or Fibre Damage

    • Investigate suspected fibre damage or signal loss through diagnostic tools.

Configuration and Compatibility

  1. Link Testing

    • Perform tests to confirm link integrity and functionality.
  2. Frequency and Type Compatibility

    • Ensure links are connected at compatible frequencies and types.
  3. Rate Limit Compliance

    • Check that the configured rate limit aligns with the customer’s order.
  4. Management VLANs

    • Validate that management VLANs are correctly provisioned and mapped.

Traffic and Path Analysis

  1. Broadcast and Unicast Ratios

    • Assess the ratio of broadcasts to unicasts to detect abnormal traffic patterns.

    • Confirm that a broadcast filter has been configured where necessary.

  2. IP Address and Subnet Configuration

    • Verify the assignment of the correct IP address, subnet mask, and associated gateway.
  3. Ping and MTR Diagnostics

    • Test ping responses and traceroutes (MTRs) for packet loss, latency, and anomalies.

    • Check if pings with varying packet sizes fail, indicating potential MTU issues.

  4. Congestion Issues

    • Investigate congestion on both primary and backup transmission paths.
  5. Layer 2 Loops

    • Check for the presence of Layer 2 loops or related symptoms, such as MAC flapping.

Path Protection and Synchronisation

  1. Path Protection

    • Verify if path protection mechanisms are active and configured correctly.

    • Review logs for any path protection flaps and adjust configurations as needed.

  2. Asymmetrical Traffic

    • Identify and address any signs of asymmetrical traffic flow.
  3. Link Synchronisation

    • Ensure that all transport links are synchronised and stable.

Wireless Path-Specific Checks

  1. Radio Interference

    • Investigate external or self-interference in the radio transmission path.
  2. LOS (Line of Sight) Issues

    • Review photos and physical inspections for any visible LOS obstructions.

    • Ensure the expected throughput over the link is within budget.

  3. Bit Error Rate (BER)

    • Check for reported BER issues on radio links.

MTU and Alignment

  1. MTU Alignment

    • Confirm that all MTU settings along the transport path are aligned to avoid fragmentation.

By following the above checklist, network teams can methodically examine the physical and logical aspects of transport and network paths. This ensures a comprehensive understanding of potential faults, aiding in faster resolution and improved network reliability.

More reading:

6
Subscribe to my newsletter

Read articles from Ronald Bartels directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ronald Bartels
Ronald Bartels

Driving SD-WAN Adoption in South Africa