A Postmorterm on Outdated Data and Slow Performance
Issue Summary:
Duration: The web application faced intermittent performance issues for a period of 24 hours starting from the May 9, 2024 11:00 EAT to May 10, 11:00 EAT. Intermittent spikes were noticed throughout the timeline stated.
Impact: Users encountered slow loading times and occasional web app outages. They were unable to access the services most of the time, and sometimes the services were slow. An estimate 24% of users were affected.
Root Cause: An aggressive caching strategy, combined with a data corruption issue, resulted in outdated and incorrect data being delivered to the users.
Timeline:
Throughout the day: Slow response times with initial user reports indicating service unavailability and occasional web app crashes.
May 9, 2024 13:15 EAT: The engineers started an initial investigation focusing on assumptions about network latency and server performance issues.
Efforts were made to optimize the server resources and troubleshoot the network connectivity.
Misleading investigation: Extensive time is spent on the technical aspects of the server and network.
May 9, 2024 16:28 EAT: The issue persists and the team decides to dive into the caching mechanisms.
May 9, 2024 17:00 EAT: With no significant finding the incident is escalated to the DevOps team that handles infrastructure including the caching systems.
May 10, 2024 9:10 EAT: Further investigation reveals inconsistencies in the cached data and a mountain of outdated information.
May 10, 2024 9:52 EAT: Root cause identified. A data corruption issue in the backend stack causes invalid data to be aggressively cached, resulting in inconsistencies being delivered to users.
May 10, 2024 11:00 EAT: By implementing a tiered caching strategy and data validation checks, the problem was fixed by 11:00 EAT on May 10, 2024.
Root Cause & Resolution:
The problem arose from a mix of issues. An overly aggressive caching strategy kept old data, and a data corruption problem on the backend kept producing wrong information that ended up cached. The fix included using a more detailed caching plan with shorter cache durations. Also, the backend data corruption problem was fixed by repairing the database and checking data integrity.
Corrective & Preventative Measures:
Introduce data validation checks at the cache layer to identify and prevent invalid data from being cached.
Implement a tiered caching strategy with different cache expiration times depending on the type of data.
Schedule regular data integrity checks and database maintenance procedures.
Enhance monitoring of cache behaviour and performance metrics.
Subscribe to my newsletter
Read articles from Boniface Munga directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by