Web stack debugging #3 outage incident report


Issue Summary
From 5:13 AM to 8:05 AM GMT on 20th January, 2023, all request to our apache server hosting our company’s website resulted in a 500 Internal Server Error. Most morning rush hour requests that usually come in between 5:30 AM and 9:30 AM were affected. The root cause of this outage is a typo in the apache configuration which was updated earlier at 3:00 AM on the same day.
Timeline ( ALL GMT)
3:00 AM: Weekly configuration and content update push begins
3:02 AM: Outage begins
5:15 AM: Customer complaints via help desk line reporting outage.
7:40 AM: Failed configuration and content update rollback.
7:50 AM: Successful configuration and content update rollback
8:05 AM: The entirety of the traffic successfully restored, reaching a state of 100% online functionality.
Root Cause
The 500 Internal Server Error outage on January 20th, 2023, from 5:13 AM to 8:05 AM GMT, was caused by a typo in the Apache configuration. During the weekly configuration and content update push at 3:00 AM, a file name path in the Apache configuration settings was incorrectly labeled. This error resulted in the server being unable to locate files when processing requests, leading to the widespread service disruption during the morning rush hours. The issue was resolved with a successful rollback of the misconfigured update at 7:50 AM, restoring the website's functionality to 100%.
Resolution and Recovery:
1. Identified Root Cause (3:02 AM): The IT team promptly initiated an investigation to determine the cause of the 500 Internal Server Error, identifying the typo in the Apache configuration as the root issue.
2. Failed Rollback Attempt (7:40 AM): Initial attempts to rollback the configuration update were unsuccessful, leading to a brief delay in resolving the issue.
3. Successful Rollback (7:50 AM): A successful rollback was executed, restoring the Apache configuration to the state prior to the erroneous update. This action successfully eliminated the typo and its impact.
4. Monitoring and Verification (Ongoing): The system was continuously monitored to ensure the restoration of full functionality. By 8:05 AM, the website had returned to 100% online status.
5. Communication with Customers (5:15 AM): Throughout the incident, regular updates were communicated to customers via the help desk line, keeping them informed about the ongoing situation and the steps being taken to resolve it.
6. Post-Incident Analysis (Post-Recovery): A comprehensive post-incident analysis was initiated to understand the factors leading to the typo, assess the impact on users and the business, and identify areas for improvement.
Corrective and Preventative Measures:
1. Regular Configuration Audits: Implement regular audits of configuration settings, particularly after updates, to catch potential typos or errors in the early stages.
2. Automated Configuration Validation: Explore the integration of automated tools or scripts to validate configuration changes before deployment, ensuring syntax correctness and preventing common errors.
3. Rollback Procedures Review: Regularly review and test rollback procedures to minimize the time taken to revert to a stable configuration in case of issues.
4. Change Management Protocols: Strengthen change management protocols to include peer reviews of configuration changes and require approvals before deployment to reduce the likelihood of human errors.
5. Documentation Updates: Maintain comprehensive and up-to-date documentation for configuration settings. Ensure that any changes made are accurately documented to prevent discrepancies.
6. Training and Awareness: Provide ongoing training and awareness programs for the IT team, emphasizing the importance of meticulous configuration changes and the potential impact of errors.
7. Incident Response Plan Enhancement: Enhance the incident response plan to include specific procedures for handling configuration-related issues, ensuring a swift and effective response.
8. Communication Protocols: Establish clear communication protocols for notifying customers about service disruptions, including regular updates on the progress of issue resolution.
9. Continuous Improvement: Foster a culture of continuous improvement, encouraging the team to learn from incidents and implement changes to prevent similar issues in the future.
By implementing these measures, the company aims to not only address the immediate incident but also fortify its systems against similar challenges in the future, fostering a more resilient and proactive IT environment.
Subscribe to my newsletter
Read articles from Nii Alabi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
