Postmortem Report: Web Stack Outage
When we become sick we visit the doctor. Who diagnoses the problem and assigns the necessary medication/advice to overcome the disease. When the issue persists, we seek further assistance, and on such occasions, they may need to look at your health and any related history. A postmortem report is the same to equal of this. It provides a progenesis of what transpired from failure until all was well in the software system(basically all IT-related stuff).
Enjoy this read of a somewhat imagined incident report as a learner of the ALX_SE curriculum.
Issue Summary:
Duration: 3 hours 30 minutes, from 10:00 AM to 1:30 PM EAT on May 9, 2024.
Impact: The primary service affected was our web application, resulting in a complete outage for 30% of users and significant slowdowns for the remaining 70%.
Root Cause: A misconfiguration in the load balancer settings, leading to an imbalance in traffic distribution among backend servers.
Timeline:
10:00 AM: Issue detected through a sudden surge in error logs and user reports of the application being inaccessible.
10:15 AM: Engineering team alerted via monitoring system(Datadog).
10:30 AM: Initial investigation focused on backend server health and database connectivity, assuming a potential database bottleneck.
11:00 AM: Misleading path: Debugging efforts concentrated on database queries and optimization strategies, but no significant issues found.
11:45 AM: Escalation to senior engineering team and network specialists due to lack of progress.
12:15 PM: Root cause identified as a misconfigured load balancer after thorough analysis of network traffic patterns.
12:50 PM: Load balancer settings adjusted to evenly distribute traffic among backend servers.
1:30 PM: Service fully restored, and performance stabilized.
Root Cause and Resolution:
Cause: The load balancer was misconfigured to favor specific backend servers, leading to uneven distribution of traffic and eventual overload.
Resolution: Load balancer settings were corrected to ensure equal distribution of incoming requests across all backend servers, restoring normal operation.
Corrective and Preventative Measures:
Improvements:
Implement automated load balancer configuration checks to prevent future misconfigurations.
Enhance monitoring systems to provide early warnings of traffic imbalances or unusual patterns.
Establish regular audits of critical infrastructure components to catch configuration discrepancies proactively.
Tasks to Address the Issue:
Schedule a review of load balancer configurations by network specialists within the next week.
Develop and implement automated tests to validate load balancer behavior under various traffic scenarios.
Conduct a comprehensive training session for all engineering teams on load balancer management and best practices.
This postmortem outlines the incident's timeline, root cause analysis, resolution steps, and proposed corrective actions to prevent similar outages in the future. By implementing these measures, we aim to bolster the reliability and resilience of our web stack infrastructure.
Subscribe to my newsletter
Read articles from Mwandoe Shali directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by