Server Outage Incident report
Table of contents
On 18th March 2023, we experienced a server outage on all our server infrastructure which resulted in our clients' inability to use our services and we sincerely apologize for the financial loss our clients have incurred during this period.
Summary
On 18th March 2023 (10 am GMT + 1), we experienced a server outage (downtime) on all of our server infrastructure which lasted for 45 minutes. As a result of this, our clients experienced an HTTP 500 error
which had a 100% impact on their business as they were unable to access our services. The root cause was not properly testing out all implemented upgrades before pushing them to production servers.
Timelines
Time (GMT + 1) | Action Taken |
9:45 AM | Upgrades implementation begins |
10:00 AM | Server Outage begins |
10:00 AM | Pagers alerted the on-call team |
10:10 AM | On-call team acknowledgement |
10:15 AM | Rollback initiation begins |
10:20 AM | Successful rollback |
10:25 AM | Server restart initiated |
10:30 AM | 100% of traffic back online |
Root Cause
At 9:45 am (GMT + 1) server upgrade was initiated across all our production servers without first releasing on our test environments and performing all necessary unit testing. Part of the upgrade shipped to the production server required authentication from a 3rd party software, this new implementation is not supported on the current version present on our servers which resulted in the downtime experienced. We were able to resolve this quickly by first performing a rollback of the server's previous state and thereafter upgrading the current version on our servers.
Preventive measures
Increase the performance metrics threshold to alert on-call engineers in the event of a possible server crash
Pushing all intended changes 1st to our test environments before shipping to the life server.
Subscribe to my newsletter
Read articles from kash-dev directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by