Server Outage Incident Report

By Egbinola Oluwakemi

7 January 2024, we experienced server outage on all our server infrastructure which resulted in our clients inability to use our services and we sincerely apologize for the financial loss our clients have incurred during this period.

Issue Summary

7 January 2024, we experienced a server outage (downtime) on all of our server infrastructure which lasted for 45 minutes. As a result of this, our clients experienced a http 500 error which had a 100% impact on their business as they were unable to access our services. The root cause was not properly testing out all implemented upgrades before pushing to production servers.

Time (GMT + 1) Actions

9:45 PM Upgrades implementation begins

10:00AM Server Outage begins

10:00AM Pagers alerted on-call team

10:10AM On-call team acknowledgement

10:15AM Rollback initiation begins

10:20AM Successful rollback

10:20AM Server restart initiated

10:32AM 100% of traffic back online

Root cause

At 9:45am (GMT + 1) server upgrade was initiated across all our production servers without first releasing on our test environments and performing all necessary unit testing. Part of the upgrade been shipped to production server required an authentication from a 3rd party software, this new implementation is not supported on the current version present on our servers which resulted in the downtime experienced. We were able to resolve this quickly by first performing a rollback the severs previous state thereafter upgrading the current version on our servers.

Preventive measures

Pushing all intended changes 1st to our test environments before shipping to life server.

Increase the performance metrics threshold to alert on-call engineers on the event of possible server crash

0
Subscribe to my newsletter

Read articles from OLUWAKEMI EGBINOLA directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

OLUWAKEMI EGBINOLA
OLUWAKEMI EGBINOLA