On 18th March 2023, we experienced a server outage on all our server infrastructure which resulted in our clients' inability to use our services and we sincerely apologize for the financial loss our clients have incurred during this period.

Summary

On 18th March 2023 (10 am GMT + 1), we experienced a server outage (downtime) on all of our server infrastructure which lasted for 45 minutes. As a result of this, our clients experienced an HTTP 500 error which had a 100% impact on their business as they were unable to access our services. The root cause was not properly testing out all implemented upgrades before pushing them to production servers.

Timelines

Time (GMT + 1)	Action Taken
9:45 AM	Upgrades implementation begins
10:00 AM	Server Outage begins
10:00 AM	Pagers alerted the on-call team
10:10 AM	On-call team acknowledgement
10:15 AM	Rollback initiation begins
10:20 AM	Successful rollback
10:25 AM	Server restart initiated
10:30 AM	100% of traffic back online

Root Cause

At 9:45 am (GMT + 1) server upgrade was initiated across all our production servers without first releasing on our test environments and performing all necessary unit testing. Part of the upgrade shipped to the production server required authentication from a 3rd party software, this new implementation is not supported on the current version present on our servers which resulted in the downtime experienced. We were able to resolve this quickly by first performing a rollback of the server's previous state and thereafter upgrading the current version on our servers.

Preventive measures

Increase the performance metrics threshold to alert on-call engineers in the event of a possible server crash
Pushing all intended changes 1st to our test environments before shipping to the life server.

Server Outage Incident report

Table of contents

Timelines

Preventive measures

Subscribe to my newsletter

kash-dev

kash-dev