Atlan's Internship Challenge: How I Built a Cloud Cost Optimization Solution


When I saw the Platform Engineer Internship challenge by Atlan, I knew it was an opportunity to push myself. Among the given assignments, I chose the Cloud Cost Optimization problem — an area I had some confidence in but had never explored deeply in a structured, end-to-end way.
What followed was an intense but fulfilling few days where I designed, built, and iterated on a two-phase cost optimization strategy. It wasn’t just about reducing cloud bills — it was about building a scalable, safe, and observable system that could work in real production environments.
Where I Started: Understanding the Problem
Cloud spending can spiral quickly if left unchecked. I’ve seen it in side projects and heard about it from others running production workloads. So the idea of helping a company like Atlan manage costs intelligently felt meaningful.
My first goal was to design a solution that reduces cost without breaking things — especially critical infrastructure. That’s why I split my approach into two phases:
Start with manual approvals for cost actions — safe, controlled, and reviewable.
Once confident, switch to full automation — fast, efficient, and scalable.
Phase 1: Manual Approval – Playing It Safe
I started by wiring up cost detection and alerting. I used a combination of:
AWS Cost Anomaly Detection + EventBridge to flag unexpected spikes.
Kubecost + Prometheus Alertmanager to monitor Kubernetes-specific resource usage.
Once alerts were in place, I needed a way to review and approve actions before they were executed. Here’s how I handled that:
Critical alerts (e.g., high-cost EC2 usage) went to Zenduty via Slack/email.
Non-critical alerts were routed via SendGrid, where I embedded approval buttons.
I used AWS API Gateway and Lambda to capture approvals and trigger actions accordingly.
Some actions I implemented:
Stopping idle EC2 instances. (Based on a rough estimate, this alone could save ~$5,000/month in a moderately sized setup.)
Tuning Kubernetes settings (HPA, VPA, Karpenter) to prevent over-provisioning.
This phase gave me confidence. It worked. It was safe. But I also realized its limitations — especially when cost spikes happened frequently and required fast action.
Phase 2: Going Fully Automated
Once the manual pipeline looked reliable, I moved to automation. My goal: real-time cost optimization, without human bottlenecks.
The key pieces of this phase were:
AWS Step Functions to orchestrate cost-saving actions.
Prometheus to monitor optimization performance.
Grafana for visualizing trends.
Fluent Bit + VictoriaMetrics to filter and store relevant logs, keeping things lean.
Now, when a cost anomaly is detected:
Step Functions are triggered automatically.
Based on conditions, actions like EC2 rightsizing, RDS auto-scaling, or Kubernetes tuning are executed.
If something goes wrong (e.g., sudden spikes or unusual behavior), Prometheus alerts roll things back via Argo CD.
It felt incredible seeing this whole thing work in a loop — detect, act, verify — without manual intervention.
What Worked, What Didn't
What worked well:
Real-time detection: Reduced response time from hours or days to minutes.
Visibility: With Kubecost and AWS Cost Explorer, I could break costs down by pod, namespace, and AWS service.
Actionability: Tools like HPA, VPA, and Karpenter helped automate resource rightsizing.
What didn’t work at first:
My logging pipeline was noisy — I had to learn to tune Fluent Bit to avoid storing unnecessary logs.
In early tests, automation accidentally stopped useful workloads. I quickly added guardrails (e.g., environment labels, risk scoring) to avoid touching production-like services.
Challenges & Next Steps
One gap I noticed was multi-cloud support — the current setup is AWS-focused. If I had more time, I would:
Bring in OpenTelemetry for broader observability.
Integrate Cast.ai or similar tools for cross-cloud cost governance.
Also, while automation is great, I now realize the value of a hybrid approach:
Keep high-risk, production actions manual (with real-time alerting).
Let automation handle routine or low-risk optimizations.
Final Thoughts
Building this system was more than just a submission — it was a crash course in production-grade cost engineering. From designing the architecture to debugging Prometheus queries, I learned a lot.
Even though I didn’t make it past this round, I’m genuinely proud of the solution I built. It deepened my understanding of Kubernetes, observability, and cloud architecture — and I’d love to build on it further.
Thanks to Atlan for the opportunity. I hope to cross paths again — and until then, I’ll keep building.
Subscribe to my newsletter
Read articles from Madhur directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Madhur
Madhur
Hi! I'm Madhur, currently pursuing a B.Tech in Computer Science. I've embarked on a journey of learning DevOps through open-source contributions while also developing my coding skills in Golang and C++. For the past year, I've been deeply involved in the DevOps realm, working on hands-on projects to refine my abilities. My experience includes navigating the AWS cloud ecosystem, where I've mastered services like EC2, S3, VPC, IAM, CloudFormation, and the CI/CD tools—CodeCommit, CodeBuild, CodeDeploy, and CodePipeline. This practical exposure has helped me understand how to efficiently manage cloud resources for various applications. I've also embraced containerization using Docker, and for configuration management, I work with Ansible. Moreover, I’ve gained experience in infrastructure automation by writing scripts with both Terraform and CloudFormation templates, ensuring scalable and resilient setups. In short, my DevOps journey has been defined by continuous learning and hands-on practice with essential tools and techniques for building robust systems. Along the way, I’ve also taken up blogging, contributing articles on Linux, Networking, Docker, AWS Cloud services, and best practices for Git and GitHub.