How to Manage and Repair a Cloud Infrastructure You Inherited

Ismail KovvuruIsmail Kovvuru
12 min read

There’s a moment every DevOps engineer dreads: logging into an unfamiliar cloud console only to find dozens of running instances, no tags, no clear owners, and no documentation — just the vague statement, “You manage it now.”

If you stay in this field long enough, you’ll face this. Teams change. Infrastructure outlives its creators. Ownership gets reassigned without context. I’ve inherited enough neglected cloud setups to know the chaos by heart — and how to keep it from turning into a production outage.

Here’s exactly what I did the last time I got handed an undocumented cloud mess — and how you can survive the same without losing sleep, data, or your sanity.

Why Following These Steps Matters

Inherited cloud infrastructure is one of the most common and riskiest situations a DevOps engineer will face. When teams grow, people leave, or companies pivot, critical systems often outlive the people who built them. What’s left behind is usually a patchwork of undocumented resources, outdated pipelines, fragile state files, and hidden costs that can break production overnight.

If you don’t have a methodical way to assess, fix, and secure what you inherit, you risk surprises that cost money, cause outages, or create security gaps no one sees until it’s too late.

Each of the steps below exists for a simple reason: to reduce risk, avoid unplanned downtime, and make sure your infrastructure is stable, secure, and understandable for you and for whoever inherits it next.

Treat these not as one-time tasks, but as repeatable habits that turn chaos into maintainable systems — even when you didn’t build them in the first place.

1 . First Things First: Breathe, Then Assess

When you inherit mystery infrastructure, the biggest threat isn’t what’s running — it’s what you don’t know is running.

What to do :

  • Opened the AWS console. Checked Cost Explorer to see where spend was spiking.

  • Pulled a full inventory with AWS Resource Explorer. Sorted by region, resource type, last modified.

  • Of course, no tags — so I exported all instance metadata to a spreadsheet for manual triage.

  • Verified whether there were AWS Config rules in place (there weren’t) — useful for drift tracking later.

Never touch before you map. If you have nothing, start a resource map in a simple sheet or diagram tool. Use AWS Config or Cloud Asset Inventory (GCP) for a baseline snapshot you can compare later.

2. Infra-as-Code: The Double-Edged Sword

This project had number of .tf files scattered in a single repo. The README? Just one line: “WIP”.
When I ran terraform plan — it wanted to destroy half the infra.

What to do :

  • Double-checked the backend — the state file was remote, but last modified weeks ago.

  • Ran terraform refresh to sync the local state with real cloud resources — crucial for drift.

  • Compared terraform state list vs. terraform show to see which resources existed but weren’t tracked.

  • For unmanaged resources, I used terraform import to bring them under control safely.

  • Scoped changes with terraform plan -target to limit blast radius.

Terraform can only manage what it knows about. Always confirm your remote state backend is working and up to date. For old or half-baked setups, refresh, import, and plan surgically before you touch apply.

3. CI/CD Breakdowns

Once the IaC was stable enough, I triggered the CI/CD pipeline to test a basic redeploy.
It failed immediately — exit code 137. Classic sign of out-of-memory or forced kill.

What to do :

  • Checked pipeline logs. Found a stale job eating up the runner’s memory.

  • SSH’d into the persistent runner, cleared zombie containers, restarted the agent.

  • Switched the pipeline config to use ephemeral runners for future builds — autoscaling to match job demands.

  • Added clear memory/resource limits for build jobs to prevent the same choke again.


Stuck runners sink deploys. If possible, use autoscaling, short-lived runners. Persistent runners are fine for some setups — but only if they’re monitored and cleaned up automatically.

4. The Kubernetes & Helm Jungle

The pipeline went green — but the deploy got stuck during rollout. kubectl rollout status showed no progress, no events.
The Helm values file? 900 lines, no comments, no structure.

What to do :

  • Ran helm diff upgrade to check what the deploy thought it was doing.

  • Validated that container images were actually published and not stale.

  • Cross-checked resource requests — a bad CPU limit can starve pods during rollout.

  • Triggered kubectl rollout undo for a fast rollback to the last working ReplicaSet.

  • Cleaned up the values file: modularized it, added comments, version-controlled it in an artifact registry for safe rollback.

Long, messy values files are a silent threat. Keep them clear and modular. Always test rollout status live, and keep a fallback image or chart version pinned in a registry.

5 . When Slack Goes Silent

While troubleshooting, I pinged the team twice — nobody answered. It was off-hours for some, out-of-scope for others.

  • Kept a live log of every step in a dedicated incident channel.

  • Wrote a quick incident doc with: what broke, what commands I ran, where the rollback is.

  • Escalated to my manager to confirm scope: “I’m fixing X. If Y breaks, we have rollback.”

  • Scheduled a proper post-incident review once things were stable.

Silence is expensive. Always over-communicate in public channels. Spin up a temporary War Room channel for anything production-impacting — it keeps stakeholders aligned and leaves a searchable history for next time.

6 . The Aftermath: Stabilize and Future-Proof

After 48 hours of patchwork fixes, infra was finally stable — for now. But stability is fragile if you don’t fix the root causes.

What to do :

  • Enforced tagging at the org level using AWS Config rules and Service Control Policies (SCPs).

  • Split the Terraform repo into proper modules with separate remote states.

  • Automated nightly drift detection and terraform plan checks in CI.

  • Required helm diff and kubectl dry-run as pipeline gates.

  • Documented a clear handover plan — no more “You do now” surprises.

Inherited chaos is a chance to raise the bar. Small hygiene improvements — tags, state management, pipeline gates — pay off every time someone new joins the team.

Important thing to do

1. Map before you touch — snapshot, tag, diagram.
2. Treat IaC like code — refresh, import, plan small, apply smaller.
3. Don’t let CI runners rot — use ephemeral agents when possible.
4. Rollbacks must be fast — keep known-good charts and images versioned.
5. Communicate in daylight — incident channels, live logs, async docs.
6. Always leave infra better than you found it.

Absolutely — here’s a clean, professional “final additions” section with each missing piece as a separate short section, each with a clear explanation and a simple, practical mini-example. This is exactly how you’d drop it at the end of the blog or weave it into the main narrative — up to you.

Additional Essentials for Inheriting and Stabilizing Cloud Infra

1. Take Backups & Snapshots

Why:
Before you touch anything risky — whether it’s Terraform, a database, or an instance — always snapshot what you might break.

Example:

  • Back up your remote state file to an S3 versioned bucket.

  • Create an EBS snapshot for critical volumes:

      aws ec2 create-snapshot --volume-id vol-0abc123
    
  • For AMIs:

      aws ec2 create-image --instance-id i-0abc123 --name "PreChangeBackup"
    

If your plan unexpectedly destroys resources, you can restore.

2. Keep Secrets Out of Configs

Why:
Long values.yaml or .tfvars files often hide plain-text passwords or API keys — an easy leak risk.

Quick Example:

  • Use Kubernetes Sealed Secrets, or AWS Secrets Manager:

      aws secretsmanager create-secret --name MyDBSecret --secret-string '{"username":"admin","password":"securepw"}'
    
  • Reference the secret in your deployment config:

      env:
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: my-db-secret
              key: password
    

Keep secrets encrypted, managed, and out of version control.

3.Add Monitoring & Alerts Early

Why:
Fixes don’t last if you can’t see when things break again.

Example:

  • Enable basic AWS CloudWatch alarms for cost spikes, EC2 usage, or CPU limits.

  • Add drift detection to your CI:

      terraform plan -detailed-exitcode
    

    A non-zero exit signals drift.

  • Use Slack or email to notify on pipeline failures or stuck rollouts.

Observability stops you from repeating surprise incidents.

4. Tag Ownership Clearly

Why:
Chaos repeats when nobody knows who owns a resource.

Example:
Use tags like:

Owner: devops-team
Contact: devops@example.com
Environment: production

In AWS, enforce via Config Rules:

“Every EC2 instance must have Owner and Environment tags.”
When things break, you know who’s on-call.

5.Automate Guardrails with Policies

Why:
Good habits slip over time — automation keeps them enforced.

Example:

  • Use AWS Control Tower to enforce SCPs like:

    “Block resources from launching in unapproved regions.”

  • Use OPA or Sentinel policies for Terraform:

      deny[msg] {
        input.resource.tags["Environment"] != "production"
        msg = "Non-production resource detected."
      }
    

Let your tools block bad infra before it hits production.

Conclusion

In the real world, you rarely get to choose the cloud infrastructure you inherit — but you always control how you stabilize, document, and improve it. By auditing what exists, backing up critical state, enforcing clear ownership, and setting automated guardrails, you turn chaos into order.

The infrastructure you clean up today will inevitably become someone else’s responsibility tomorrow. Leave clear runbooks, clear tags, and clear pipelines so the next engineer has a fighting chance to keep your systems resilient and your business running smoothly.

Cleaning up inherited infrastructure isn’t just about quick fixes — it’s about applying proven practices, tools, and frameworks that keep your cloud stable long after you’re gone. If you want to deepen your understanding and turn these lessons into habits, these trusted resources are worth exploring:

1. AWS Well-Architected Framework
A set of principles and chec
klists for designing, operating, and continuously improving secure, high-performing, resilient, and efficient cloud workloads. Use it to evaluate your current state and prioritize improvements.

2. Terraform Official Documentat**ion**
The definitive guide for everything Terraform: how to write modules, manage state safely, run drift detection, and enforce guardrails with policy as code. Perfect for turning fragile scripts into repeatable infrastructure.

3. Helm Documenta**tion**
The go-to reference for deploying and managing Kubernetes applications with Helm charts. Learn best practices for templating, values files, and version control to avoid messy rollouts.

4. Policy-as-Code Tools:
Consider tools like Open Policy Agent (OPA) or Sentinel (HashiCorp) to automate your infrastructure governance. They help enforce tagging, resource constraints, and secure configurations automatically.

5. Secrets Management Best Practices:
Explore services like AWS Secrets Manager or HashiCorp Vault to securely store and rotate credentials, API keys, and certificates — and keep secrets out of your codebase for good.

Key Terms & Definitions

TermDefinitionWhat It’s Useful For
AWS Cost ExplorerAWS tool to visualize and analyze cloud spend across services, accounts, and tags.Identifying where money is spent and spotting unused or forgotten resources.
AWS ConfigAWS service that records configuration changes and resource relationships over time.Tracking infra drift, enforcing compliance rules, auditing changes.
Servic**e Cont[rol Policies (SCPs)**](https://aws.amazon.com/secrets-manager/)AWS Organizations feature to set permission guardrails across accounts.Enforcing rules like mandatory tagging or region restrictions at org level.
Resource ExplorerAWS console feature for searching and locating cloud resources across regions and services.Getting an inventory of all running infra when documentation is missing.
TagsMetadata key-value pairs attached to cloud resources (e.g., Environment: Production).Organizing, managing costs, ownership tracking, automation.
Infrastructure as Code (IaC)Practice of managing cloud infrastructure through code and version control (e.g., Terraform).Repeatable, automated infra provisioning with clear history and rollback.
TerraformPopular open-source tool for building, changing, and versioning infrastructure safely and efficiently.Defining cloud resources in code, managing state, and applying changes consistently.
Terraform State FileFile that stores the current known state of infrastructure resources managed by Terraform.Keeping Terraform aware of what it controls to plan safe updates and prevent accidental destruction.
terraform planTerraform command that shows what changes will be made before they’re applied.Safe preview of what will change — to catch mistakes before you run apply.
terraform refreshCommand to update Terraform’s state file with the real-world cloud state.Syncing local state with actual cloud resources to avoid drift and surprises.
terraform importCommand to add existing cloud resources into Terraform’s state so they can be managed as code.Bringing manually created resources under code control safely.
Ephemeral RunnerCI/CD agent that is created for a single job and destroyed afterwards.Preventing stale state, zombie processes, and resource leaks in build pipelines.
Persistent RunnerLong-lived CI/CD agent that handles multiple jobs over time.Useful for stable workloads, but needs careful monitoring to avoid stale state and failures.
Exit Code 137Unix/Linux signal indicating a process was killed, often due to out-of-memory or forced termination.Debugging why pipelines or jobs fail unexpectedly.
CI/CD PipelineContinuous Integration / Continuous Deployment pipeline automating build, test, and deploy tasks.Ensures consistent, repeatable software delivery with automated checks and deploys.
HelmPackage manager for Kubernetes that helps deploy and manage applications using charts.Simplifying complex Kubernetes deployments with templates and versioning.
Helm ChartA Helm package containing Kubernetes resource templates and default configuration values.Standardizing app deployments across environments.
Helm Values FileYAML file providing configuration values for a Helm chart’s templates.Customizing how charts deploy resources for specific environments.
helm diffHelm plugin that shows differences between a deployed release and an upgraded chart.Validating what will change before applying updates to Kubernetes.
kubectl rollout statusCommand to check the status of a Kubernetes deployment rollout.Monitoring whether a deploy is progressing or stuck.
kubectl rollout undoCommand to revert a Kubernetes deployment to a previous stable version.Rolling back quickly when a deploy breaks production.
RunbookA documented step-by-step guide for operational tasks or incident response.Keeping teams aligned during troubleshooting and handovers.
War Room / Incident ChannelDedicated chat or video room for real-time collaboration during major incidents.Centralizing communication, updates, logs, and decisions when fixing production issues.
Drift DetectionProcess of finding differences between your infra-as-code definitions and actual running cloud resources.Preventing surprises when running plan or apply, maintaining clean, predictable infra.

Read these Articles too

  1. 10 Proven kubectl Commands: The Ultimate 2025 AWS Kubernetes Guide

  2. One Container per Pod: Kubernetes Done Right

  3. Why Kubernetes Cluster Autoscaler Fails ? Fixes, Logs & YAML Inside

  4. Ansible Inventory Guide 2025

  5. DevOps without Observability

  6. How to debug crashing kubernetes pods

For more topics visit Medium , Dev.to , Red Signals and Dubniumlabs

0
Subscribe to my newsletter

Read articles from Ismail Kovvuru directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ismail Kovvuru
Ismail Kovvuru

DevOps Engineer automating cloud infrastructure using AWS, Terraform, Docker & CI/CD. I share tutorials, real-world DevOps workflows & automation strategies that help teams ship faster and more reliably.