A Postmorterm on Outdated Data and Slow Performance

Boniface MungaBoniface Munga
2 min read

Issue Summary:

  • Duration: The web application faced intermittent performance issues for a period of 24 hours starting from the May 9, 2024 11:00 EAT to May 10, 11:00 EAT. Intermittent spikes were noticed throughout the timeline stated.

  • Impact: Users encountered slow loading times and occasional web app outages. They were unable to access the services most of the time, and sometimes the services were slow. An estimate 24% of users were affected.

  • Root Cause: An aggressive caching strategy, combined with a data corruption issue, resulted in outdated and incorrect data being delivered to the users.

Timeline:

  • Throughout the day: Slow response times with initial user reports indicating service unavailability and occasional web app crashes.

  • May 9, 2024 13:15 EAT: The engineers started an initial investigation focusing on assumptions about network latency and server performance issues.

  • Efforts were made to optimize the server resources and troubleshoot the network connectivity.

  • Misleading investigation: Extensive time is spent on the technical aspects of the server and network.

  • May 9, 2024 16:28 EAT: The issue persists and the team decides to dive into the caching mechanisms.

  • May 9, 2024 17:00 EAT: With no significant finding the incident is escalated to the DevOps team that handles infrastructure including the caching systems.

  • May 10, 2024 9:10 EAT: Further investigation reveals inconsistencies in the cached data and a mountain of outdated information.

  • May 10, 2024 9:52 EAT: Root cause identified. A data corruption issue in the backend stack causes invalid data to be aggressively cached, resulting in inconsistencies being delivered to users.

  • May 10, 2024 11:00 EAT: By implementing a tiered caching strategy and data validation checks, the problem was fixed by 11:00 EAT on May 10, 2024.

Root Cause & Resolution:

The problem arose from a mix of issues. An overly aggressive caching strategy kept old data, and a data corruption problem on the backend kept producing wrong information that ended up cached. The fix included using a more detailed caching plan with shorter cache durations. Also, the backend data corruption problem was fixed by repairing the database and checking data integrity.

Corrective & Preventative Measures:

  • Introduce data validation checks at the cache layer to identify and prevent invalid data from being cached.

  • Implement a tiered caching strategy with different cache expiration times depending on the type of data.

  • Schedule regular data integrity checks and database maintenance procedures.

  • Enhance monitoring of cache behaviour and performance metrics.

0
Subscribe to my newsletter

Read articles from Boniface Munga directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Boniface Munga
Boniface Munga