postmortem

Any software system will eventually fail, and that failure can stem from a wide range of possible factors: bugs, traffic spikes, security issues, hardware failures, natural disasters, human error… Failing is normal and failing is actually a great opportunity to learn and improve. Any great Software Engineer must learn from his/her mistakes to ensure they won’t happen again. Failing is fine, but failing twice because of the same issue is not.

A postmortem is a tool widely used in the tech industry. A project post-mortem is a process used to identify the causes of a project failure (or significant business-impairing downtime), and how to prevent them in the future

After any outage, the team(s) in charge of the system will write a summary that has 2 main goals:

To provide the rest of the company’s employees easy access to information detailing the cause of the outage. Often outages can have a significant impact on a company, so managers and executives have to understand what happened and how it will impact their work.
And to ensure that the root cause(s) of the outage has been discovered and that measures are taken to make sure it will be fixed.

Below are the requirements for writing a Post-mortem.

Requirements:
- Issue Summary (that is often what executives will read) must contain:
  - duration of the outage with start and end times (including timezone)
  - what was the impact (what service was down/slow? What were users experiencing? How many % of the users were affected?)
  - what was the root cause
- Timeline (format bullet point, format: time - keep it short, 1 or 2 sentences) must contain:
  - when was the issue detected
  - how was the issue detected (monitoring alert, an engineer noticed something, a customer complained…)
  - actions taken (what parts of the system were investigated, what were the assumption on the root cause of the issue)
  - misleading investigation/debugging paths that were taken
  - which team/individuals was the incident escalated to
  - how the incident was resolved
- Root cause and resolution must contain:
  - explain in detail what was causing the issue
  - explain in detail how the issue was fixed
- Corrective and preventative measures must contain:
  - what are the things that can be improved/fixed (broadly speaking)
  - a list of tasks to address the issue (be very specific, like a TODO, for example: patch Nginx server, add monitoring on server memory…)
- Be brief and straight to the point, between 400 to 600 words

With these in mind, we are ready to write a post-mortem. Here is how I wrote mine 😏

Issue Summary:

On May 5, 2023, from 2:00 PM to 4:30 PM (EST), the web server of our e-commerce website experienced an outage, resulting in slow response times and error messages for some users. Approximately 30% of our customers were affected by this outage.

Root Cause:

The root cause of the outage was identified as a sudden surge in traffic caused by a marketing campaign that was launched without proper load testing. As a result, the web server was overwhelmed and could not handle the incoming requests, causing it to slow down and eventually crash.

Timeline:

2:00 PM - The issue was detected when the monitoring system alerted the operations team of an increase in response times and error rates.

2:05 PM - The operations team investigated the web server logs and identified a sudden surge in traffic.

2:10 PM - The team assumed that the issue was caused by a DDoS attack and initiated measures to mitigate the attack.

2:30 PM - After analyzing the network traffic, the team realized that the surge in traffic was caused by a marketing campaign that was launched without proper load testing.

3:00 PM - The team attempted to optimize the server configurations to handle the increased traffic but was unsuccessful.

3:30 PM - The team decided to scale up the web server infrastructure by adding additional servers to handle the traffic.

4:00 PM - The newly added servers were configured and deployed to handle the traffic.

4:30 PM - The web server infrastructure was fully operational, and the issue was resolved.

Root Cause and Resolution:

The root cause of the issue was a sudden surge in traffic caused by a marketing campaign that was launched without proper load testing. To resolve the issue, the team scaled up the web server infrastructure by adding additional servers to handle the traffic.

Corrective and Preventative Measures:

To prevent a similar outage from happening in the future, the following measures will be implemented:

Load testing will be performed before any new marketing campaigns or promotions are launched.
The web server infrastructure will be optimized for scalability and elasticity to handle sudden traffic surges.
The monitoring system will be enhanced to detect traffic surges and alert the operations team promptly.
The team will establish a standard incident response plan to address similar issues quickly and effectively.

Tasks to address the issue:

Perform load testing before any new marketing campaigns or promotions are launched.
Optimize the web server infrastructure for scalability and elasticity.
Enhance the monitoring system to detect traffic surges and alert the operations team promptly.
Establish a standard incident response plan to address similar issues quickly and effectively.
Train the team on incident response and load testing best practices.

In conclusion, the outage that occurred on our e-commerce website was caused by a sudden surge in traffic due to a marketing campaign that was launched without proper load testing. The issue was resolved by scaling up the web server infrastructure. To prevent similar outages in the future, we will implement measures such as load testing, optimizing the infrastructure, and enhancing the monitoring system.

I hope this helps you get better at writing a good incident report as a Software Engineer 👍

My first postmortem (Incident Report)

Table of contents