When the Internet Dies During Terraform Apply: A Recovery Story

EnochEnoch
3 min read

Picture this: You're deploying a production-ready EKS cluster for a high-security organization using Terraform. Everything is going smoothly. Your VPC is being created, subnets are spinning up, and then... your internet connection drops.

What happens next is a perfect storm of Terraform state management issues that every infrastructure engineer will face at some point.

The Perfect Storm

When network connectivity fails during a⁣,terraform apply several things go wrong simultaneously:

  1. State Lock Limbo: Terraform can't release the DynamoDB state lock, leaving it orphaned

  2. State Upload Failure: Your S3 backend becomes unreachable, so state changes can't be persisted

  3. Resource Creation Interruption: AWS resources may be partially created but not tracked properly

  4. Local State Backup: Terraform saves an errored.tfstate file as a safety net

Here's what the error looked like:

Error: Failed to save state

│ 
│ Error saving state: failed to upload state: operation error S3: PutObject,
│ https response error StatusCode: 0, RequestID: , HostID: , request send
│ failed, Put
│ "https://prod-infra-tf-state-2025.s3.us-west-2.amazonaws.com/main-infra/terraform.tfstate?x-id=PutObject":
│ dial tcp: lookup
│ prod-infra-tf-state-2025.s3.us-west-2.amazonaws.com: no such host

The Recovery Process

The beauty of Terraform's design shines in moments like these. Here's the systematic recovery:

Step 1: Force Unlock the State

terraform force-unlock xxx8b0e7e1-4f7c-4cxx-9c2a-xxx71a6a9b1c

The lock ID is conveniently displayed in the error message. This releases the orphaned lock from DynamoDB.

Step 2: Recover the Local State

terraform state push errored.tfstate

This pushes the locally saved state back to your S3 backend, ensuring no infrastructure changes are lost.

Step 3: Assess and Continue

terraform plan
terraform apply

Check what actually got created and complete the deployment.

Lessons Learned

  1. Terraform's Resilience: The tool gracefully handles network failures by saving state locally

  2. State Lock Transparency: Error messages provide all the information needed for recovery

  3. Remote State Benefits: Using S3 + DynamoDB backends provides robust state management even during failures

  4. Infrastructure as Code Reliability: Well-designed IaC can recover from unexpected interruptions

Prevention Tips

  • Use stable internet connections for critical deployments

  • Consider running Terraform from cloud instances with reliable connectivity

  • Monitor deployment progress and be prepared for recovery procedures

  • Always use remote state backends for production workloads

Conclusion

Network failures during infrastructure deployments are inevitable, but Terraform's state management system makes recovery straightforward. What could have been a disaster—losing track of partially created AWS resources—becomes a minor inconvenience with the right recovery steps.

The production EKS cluster deployment continued successfully after this hiccup, proving that robust tooling and systematic recovery processes are essential for production infrastructure management.


This incident occurred during the deployment of a production EKS cluster with VPC, managed node groups, and security configurations using Terraform modules.


0
Subscribe to my newsletter

Read articles from Enoch directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Enoch
Enoch

I have a passion for automating and optimizing cloud infrastructure. I have experience working with various cloud platforms, including AWS, Azure, and Google Cloud. My goal is to help companies achieve scalable, reliable, and secure cloud environments that drive business success.