Analyzing A Server Failure: A Postmortem Case Study

Table of contents

A postmortem is a valuable tool widely used in the tech industry to retrospectively examine failures. It aids in identifying and addressing issues with assets, systems, or technology platforms. Postmortems are not only common in maintenance but also find applications in software development and design.

Illustrating the concept, we'll delve into a case study involving a website shutdown and how a postmortem can be effectively conducted.

Issue Summary:

On May 10, 2022, from 2:30 p.m. to 5:00 p.m. (WAT), our e-commerce website experienced a total shutdown. Throughout this timeframe, users could not access the site since it became unresponsive.

Impact:

The outage affected all services provided by the website, including product listings, the shopping cart function, and the checkout process. Approximately 80% of our users encountered error messages or unresponsive pages, highlighting the significant impact of the outage.

Root Cause:

The server's failure resulted from a memory leakage within our web application. This leakage caused the server to become overloaded and unresponsive, ultimately leading to a complete website shutdown.

Incident Timelines:

Our Users experienced error messages and unresponsive pages. The outage had significant impact which affected the services offered on the website.

Root cause:

  • A memory leakage in web application caused the Server to become overloaded and unresponsive which resulted into a complete shutdown of the website.

    Incident Timelines:

  • 2:30 AM: The monitoring systems alerted operations team of the issue

  • 2:35 AM: An attempt to restart web application server failed.

  • 2:40 AM: Investigation of potential server configuration problem began.

  • 3:00 AM: Observed elevated memory usage; suspected memory leakage.

  • 3:15 AM: Began examination of application code for potential causes.

  • 3:45 AM: Memory leak in code was identified and a fix was initiated.

  • 4:30 AM: The fix was deployed and a Server restart was executed.

  • 4:45 AM: Full Website functionality was restored.

Misleading Investigation/Debugging Paths: Initially attributing the problem to server configuration delayed identifying the root cause, which was within the application code.

Incident Escalation:

Handled initially by the operations team, the incident later escalated as the development team identified the code as the source of the problem.

Resolution:

Identification and rectification of a memory leakage in the web application code occurred. This involved optimizing the code and implementing memory management best practices. Post-patch deployment, the web application server was restarted, restoring full website functionality.

Preventative Measures:

On order to prevent future occurrences, the following steps were agreed to be taken:

  • A regular code reviews for potential memory leaks.

  • Implementation of a robust testing procedures.

  • Detection of memory leaks before production.

  • Enhanced monitoring of server performance and resource usage.

  • Improved documentation and training for the operations team to handle similar incidents.

Specific tasks to address the issue include:

  • A Comprehensive web application code review.

  • An Integration of automated tests for memory leak detection.

  • Enhancement of monitoring tools with granular resource usage data.

  • Provision of extra training workshop for operations team members on troubleshooting web application issues.

By adhering to these principles and taking proactive steps, we can effectively manage and prevent similar server failures in the future.

0
Subscribe to my newsletter

Read articles from Ochagla Samson Adakole directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ochagla Samson Adakole
Ochagla Samson Adakole

I am a learning-in-process developer. I love tech and also an Aircraft maintenance Engineering student. I would love to merge new techs with the Aviation industry.