Server Outage Incident report

kash-devkash-dev
2 min read

On 18th March 2023, we experienced a server outage on all our server infrastructure which resulted in our clients' inability to use our services and we sincerely apologize for the financial loss our clients have incurred during this period.

Summary

On 18th March 2023 (10 am GMT + 1), we experienced a server outage (downtime) on all of our server infrastructure which lasted for 45 minutes. As a result of this, our clients experienced an HTTP 500 error which had a 100% impact on their business as they were unable to access our services. The root cause was not properly testing out all implemented upgrades before pushing them to production servers.

Timelines

Time (GMT + 1)Action Taken
9:45 AMUpgrades implementation begins
10:00 AMServer Outage begins
10:00 AMPagers alerted the on-call team
10:10 AMOn-call team acknowledgement
10:15 AMRollback initiation begins
10:20 AMSuccessful rollback
10:25 AMServer restart initiated
10:30 AM100% of traffic back online

Root Cause

At 9:45 am (GMT + 1) server upgrade was initiated across all our production servers without first releasing on our test environments and performing all necessary unit testing. Part of the upgrade shipped to the production server required authentication from a 3rd party software, this new implementation is not supported on the current version present on our servers which resulted in the downtime experienced. We were able to resolve this quickly by first performing a rollback of the server's previous state and thereafter upgrading the current version on our servers.

Preventive measures

  1. Increase the performance metrics threshold to alert on-call engineers in the event of a possible server crash

  2. Pushing all intended changes 1st to our test environments before shipping to the life server.

0
Subscribe to my newsletter

Read articles from kash-dev directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

kash-dev
kash-dev