Postmortem Report: E-commerce Website Outage - A Rollercoaster Ride in the Digital Realm

Samuel OgboyeSamuel Ogboye
4 min read

Introduction:

A postmortem report, also known as a post-mortem analysis or simply a postmortem, is a document that provides a detailed and systematic review of a project, event, process, or situation after it has concluded or failed. It is used to assess what went well, what went wrong, and what lessons can be learned for future improvement. Postmortem reports are commonly employed in various fields, such as project management, software development, healthcare, and business, to help teams and organizations gain insights and make informed decisions for future endeavors.

In our interconnected digital landscape, where the online world keeps our businesses and daily lives running smoothly, imagine the chaos when the very foundation of a critical e-commerce website starts to crumble. It's like a rollercoaster ride of unexpected twists and turns, where the thrill is replaced with anxiety as the site slows down and essential features become erratic. This postmortem report takes you on a wild journey through a real-life ordeal when our e-commerce website took an unexpected rollercoaster ride due to a database server failure. Buckle up as we explore the highs and lows, the root cause, and the lessons learned from this digital adventure

Issue Summary:

  • Duration:
    Buckle up for a wild ride, the outage ran from 2023-11-03 14:30 UTC to 2023-11-03 18:45 UTC.

  • Impact:
    Hold on to your hats! Our e-commerce website experienced turbulence, causing slow loading times and reduced functionality. Approximately 30% of our users found themselves on a bumpy road, resulting in a heart-stopping loss of potential revenue.

  • Root Cause:
    What's a rollercoaster without a twist? The culprit was a database server that lost its lunch, thanks to a sudden traffic surge and an unexpected loop-de-loop in the configuration.

Timeline:

  • Detected:
    Imagine a sudden drop - the issue plummeted onto our radar at 14:30 UTC when monitoring alerts screamed louder than a rollercoaster enthusiast.

  • Actions Taken:

    1. Our fearless engineer, Sarah, leaped into action, ready for the ride of her life.

    2. First loop: Sarah thought the traffic surge was just part of the ride, so she started scaling up resources.

    3. As the rollercoaster continued its unexpected twists and turns, Sarah pulled in the database team.

    4. The database team discovered that someone had sneakily modified the coaster's configuration, turning our smooth ride into a wild adventure.

  • Misleading Investigation/Debugging Paths:
    At first, we thought we were just on a thrilling ride. We spent valuable time scaling resources when we should have been checking the nuts and bolts of our configuration.

  • Escalation:
    Our rollercoaster ride reached new heights when we called in the database team for an adrenaline-fueled investigation.

  • Resolution:

    1. The database team got to the root of our wild ride - a configuration change that went off the rails.

    2. They hit the emergency brakes and rolled us back to a stable configuration.

    3. After a brief pause, they sent the coaster on its way again, and at 18:45 UTC, we finally stepped off the rollercoaster, feeling a little queasy but wiser.

Root Cause and Resolution:

  • Root Cause:
    This ride's got a plot twist! It was a mischievous configuration change in the database server, turning our pleasant ride into a wild adventure.

  • Resolution:
    The database team yanked out the misbehaving configuration, tightened the bolts, and restarted the coaster. The result? A smooth, enjoyable experience for our users once again.

Corrective and Preventative Measures:

  • Improvements/Fixes:

    1. We're getting better seat belts - implement stricter change management procedures for a smoother ride.

    2. More cameras on the tracks - enhance our monitoring and alerting systems to spot sharp turns and steep drops.

  • Tasks:

    1. Let's review the rollercoaster manual, I mean, the configuration change. Document lessons learned from this wild ride.

    2. Upgrade to the deluxe package - more testing and peer review for configuration changes in the future.

    3. Our rollercoaster needs a boost - improve database server resource scaling to handle unexpected thrills.

    4. Map out the ride - enhance our incident response documentation for future adventurers.

    5. Strap in tight - review and update our monitoring and alerting rules to spot configuration anomalies and loop-de-loops.

1
Subscribe to my newsletter

Read articles from Samuel Ogboye directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Samuel Ogboye
Samuel Ogboye