Postmortem Report: The Great Database Connection Crisis 🐢

Abang LazAbang Laz
3 min read

Issue Summary

Duration:

  • Start: August 17, 2024, 14:00 UTC

  • End: August 17, 2024, 16:30 UTC
    (Yes, it was a long and painful 2 hours and 30 minutes!)

Impact:
Imagine trying to board a bus with a thousand other people, but there’s only one bus with five seats. That’s what our users experienced: 80% of them were stuck waiting in line, unable to log in or perform transactions. Error messages were everywhere, and frustration levels were through the roof.

Root Cause:
A small but mighty misconfiguration in the database connection pool. We set the maximum connections to a level that even a tortoise could outpace, leaving users stranded without a way in.

Timeline

  • 14:00 UTC:
    "Houston, we have a problem!"—Our monitoring system sent an SOS signal, showing that database queries were taking longer than a toddler’s nap.

  • 14:05 UTC:
    An engineer checked the logs and saw a flood of failed requests. Cue the dramatic background music.

  • 14:10 UTC:
    First suspicion: Network issues. "Is the internet broken?" Nope, everything checked out fine. The plot thickens.

  • 14:30 UTC:
    We started looking into recent changes, including a tweak in the database connection pool. "Nah, that couldn’t be it...could it?"

  • 15:00 UTC:
    After chasing wild geese and red herrings, we circled back to the connection pool. Ding ding ding! We found our culprit: an embarrassingly low max connection setting.

  • 15:15 UTC:
    Time to call in the big guns: the database team. "Help! Our database is slower than dial-up!"

  • 15:30 UTC:
    They swooped in, adjusted the settings, and BOOM—problem solved! The tortoise finally decided to run.

  • 16:00 UTC:
    The system was back up, users could log in, and the sun shone once again.

  • 16:30 UTC:
    We beefed up our monitoring, poured a coffee, and took a much-needed break.

Root Cause and Resolution

Root Cause:
The connection pool was configured to allow fewer connections than a kiddie pool. This led to a scenario where every user was fighting for the same few slots, causing a massive bottleneck. The database was overbooked, and we were left with a digital traffic jam.

Resolution:
We bumped up the maximum number of connections to a more reasonable number, so everyone could get a seat on the bus. We also tightened up the connection timeout settings to ensure that idle connections didn’t overstay their welcome. Problem solved, and the database was back to running smoother than butter on a hot pancake.

Corrective and Preventative Measures

Improvements Needed:

  • Enhanced Monitoring: We’ll keep a closer eye on connection usage, so we don’t get caught off guard again.

  • Configuration Reviews: No more ā€œset it and forget it.ā€ We’ll regularly review and tweak our settings to match our traffic.

  • Load Testing: We’ll simulate heavy traffic more often to catch potential issues before they catch us.

Task List:

  1. Patch the Connection Pool:
    Apply the new settings across all environments to keep everything running smoothly.

  2. Add Connection Usage Monitoring:
    Set up alarms for when we’re nearing capacity, so we can act before it’s too late.

  3. Schedule Regular Load Tests:
    Quarterly stress tests to make sure our system can handle peak loads without breaking a sweat.

  4. Revamp Configuration Management:
    From now on, every change gets a double-check to avoid another crisis like this.

0
Subscribe to my newsletter

Read articles from Abang Laz directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abang Laz
Abang Laz