Problem Statement

While managing a critical production workload on AWS, I noticed that one of our primary EC2 instances (running a multi-tiered web application) was repeatedly hitting 100% CPU utilization during peak hours. This resulted in increased latency, failed health checks, and occasional downtime for end users. The issue was not security-related but was severely impacting application performance and availability.

Initial Investigation

Symptoms: CloudWatch alarms were triggered for high CPU utilization. Users reported slow response times, and the instance occasionally failed AWS status checks.
Baseline Check: I reviewed historical CloudWatch metrics and observed that CPU usage was previously stable, with occasional spikes, but now it was consistently maxing out during business hours.
Instance Type: The instance was a t3.medium, which is a burstable type, and had enough CPU credits for most workloads under normal conditions.

Root Cause Analysis

Process Analysis: Using SSH, I accessed the instance and ran top and htop to identify processes. A Java application (Spring Boot service) was consuming over 80% of CPU resources during spikes.
Log Review: Application logs revealed that a new feature was recently rolled out, which triggered complex background calculations and increased database queries.
Network Activity: Network monitoring via iftop and netstat showed increased traffic, but not enough to explain the CPU spikes.
Database Impact: The application was making repetitive, inefficient queries to a MySQL database, causing both the app and database to strain under load.

Comparative Research

Community Insights: Other cloud engineers reported similar issues—spikes in CPU usage due to application logic changes, inefficient code, or unexpected traffic (e.g., web crawlers).
Solutions Tried by Others: Many recommended profiling application code, optimizing queries, scaling up instance sizes, or implementing auto-scaling groups.
Common Pitfalls: Engineers noted that simply rebooting or resizing instances might provide temporary relief but not address the root cause.

Innovative Solution Process

Application Profiling:
- I used Java Flight Recorder and VisualVM to profile the Spring Boot application and identified a specific background job that was running inefficient loops.
- The job was recalculating data that had not changed, resulting in redundant CPU cycles.
Query Optimization:
- Reviewed MySQL slow query logs and added appropriate indexes to the tables most frequently queried.
- Implemented query caching for repeated requests.
Code Refactoring:
- Refactored the background job to only recalculate data when necessary, using a last-modified timestamp comparison.
- Introduced rate-limiting for the feature to prevent sudden spikes in CPU usage.
Infrastructure Adjustments:
- Provisioned a compute-optimized instance (c5.large) for the application tier to handle the increased load during peak hours.
- Implemented an Auto Scaling Group with CloudWatch alarms to scale out during high demand and scale in during off-peak hours.
Monitoring and Alerting:
- Enhanced monitoring by integrating Datadog for deeper application performance insights.
- Set up Slack alerts for abnormal CPU and query patterns.

Resolution and Impact

Performance Gains: After deploying the code changes and infrastructure updates, CPU utilization dropped to a healthy 40–60% during peak hours.
Reliability: The instance no longer failed health checks, and application uptime improved significantly.
Productivity: The team became more proactive in monitoring and profiling new features before deployment.
Positive Side Effects: The new monitoring setup helped identify and resolve other inefficiencies, further improving overall system performance.

Lessons Learned

Monitor Before and After: Always establish a performance baseline before rolling out new features.
Profile Early: Use profiling tools to catch inefficiencies before they impact production.
Collaborate: Engage with the broader cloud engineering community to learn from their experiences and solutions.
Automate Scaling: Implement auto-scaling and robust monitoring to handle unexpected load gracefully.

Final Thoughts

This experience reinforced the importance of thorough testing, continuous monitoring, and community collaboration. By combining code optimization with smart infrastructure choices, I was able to resolve a persistent performance issue and improve the reliability of our cloud environment.

Cloud Engineer’s Journal: Unexpected High CPU Utilization on Production EC2 Instance

Subscribe to my newsletter

Sameer i

Sameer i