When I saw the Platform Engineer Internship challenge by Atlan, I knew it was an opportunity to push myself. Among the given assignments, I chose the Cloud Cost Optimization problem — an area I had some confidence in but had never explored deeply in a structured, end-to-end way.

What followed was an intense but fulfilling few days where I designed, built, and iterated on a two-phase cost optimization strategy. It wasn’t just about reducing cloud bills — it was about building a scalable, safe, and observable system that could work in real production environments.

Where I Started: Understanding the Problem

Cloud spending can spiral quickly if left unchecked. I’ve seen it in side projects and heard about it from others running production workloads. So the idea of helping a company like Atlan manage costs intelligently felt meaningful.

My first goal was to design a solution that reduces cost without breaking things — especially critical infrastructure. That’s why I split my approach into two phases:

Start with manual approvals for cost actions — safe, controlled, and reviewable.
Once confident, switch to full automation — fast, efficient, and scalable.

Phase 1: Manual Approval – Playing It Safe

I started by wiring up cost detection and alerting. I used a combination of:

AWS Cost Anomaly Detection + EventBridge to flag unexpected spikes.
Kubecost + Prometheus Alertmanager to monitor Kubernetes-specific resource usage.

Once alerts were in place, I needed a way to review and approve actions before they were executed. Here’s how I handled that:

Critical alerts (e.g., high-cost EC2 usage) went to Zenduty via Slack/email.
Non-critical alerts were routed via SendGrid, where I embedded approval buttons.
I used AWS API Gateway and Lambda to capture approvals and trigger actions accordingly.

Some actions I implemented:

Stopping idle EC2 instances. (Based on a rough estimate, this alone could save ~$5,000/month in a moderately sized setup.)
Tuning Kubernetes settings (HPA, VPA, Karpenter) to prevent over-provisioning.

This phase gave me confidence. It worked. It was safe. But I also realized its limitations — especially when cost spikes happened frequently and required fast action.

Phase 2: Going Fully Automated

Once the manual pipeline looked reliable, I moved to automation. My goal: real-time cost optimization, without human bottlenecks.

The key pieces of this phase were:

AWS Step Functions to orchestrate cost-saving actions.
Prometheus to monitor optimization performance.
Grafana for visualizing trends.
Fluent Bit + VictoriaMetrics to filter and store relevant logs, keeping things lean.

Now, when a cost anomaly is detected:

Step Functions are triggered automatically.
Based on conditions, actions like EC2 rightsizing, RDS auto-scaling, or Kubernetes tuning are executed.
If something goes wrong (e.g., sudden spikes or unusual behavior), Prometheus alerts roll things back via Argo CD.

It felt incredible seeing this whole thing work in a loop — detect, act, verify — without manual intervention.

What Worked, What Didn't

What worked well:

Real-time detection: Reduced response time from hours or days to minutes.
Visibility: With Kubecost and AWS Cost Explorer, I could break costs down by pod, namespace, and AWS service.
Actionability: Tools like HPA, VPA, and Karpenter helped automate resource rightsizing.

What didn’t work at first:

My logging pipeline was noisy — I had to learn to tune Fluent Bit to avoid storing unnecessary logs.
In early tests, automation accidentally stopped useful workloads. I quickly added guardrails (e.g., environment labels, risk scoring) to avoid touching production-like services.

Challenges & Next Steps

One gap I noticed was multi-cloud support — the current setup is AWS-focused. If I had more time, I would:

Bring in OpenTelemetry for broader observability.
Integrate Cast.ai or similar tools for cross-cloud cost governance.

Also, while automation is great, I now realize the value of a hybrid approach:

Keep high-risk, production actions manual (with real-time alerting).
Let automation handle routine or low-risk optimizations.

Final Thoughts

Building this system was more than just a submission — it was a crash course in production-grade cost engineering. From designing the architecture to debugging Prometheus queries, I learned a lot.

Even though I didn’t make it past this round, I’m genuinely proud of the solution I built. It deepened my understanding of Kubernetes, observability, and cloud architecture — and I’d love to build on it further.

Thanks to Atlan for the opportunity. I hope to cross paths again — and until then, I’ll keep building.

Atlan's Internship Challenge: How I Built a Cloud Cost Optimization Solution