When the Internet Dies During Terraform Apply: A Recovery Story


Picture this: You're deploying a production-ready EKS cluster for a high-security organization using Terraform. Everything is going smoothly. Your VPC is being created, subnets are spinning up, and then... your internet connection drops.
What happens next is a perfect storm of Terraform state management issues that every infrastructure engineer will face at some point.
The Perfect Storm
When network connectivity fails during a,terraform apply
several things go wrong simultaneously:
State Lock Limbo: Terraform can't release the DynamoDB state lock, leaving it orphaned
State Upload Failure: Your S3 backend becomes unreachable, so state changes can't be persisted
Resource Creation Interruption: AWS resources may be partially created but not tracked properly
Local State Backup: Terraform saves an
errored.tfstate
file as a safety net
Here's what the error looked like:
Error: Failed to save state
│
│ Error saving state: failed to upload state: operation error S3: PutObject,
│ https response error StatusCode: 0, RequestID: , HostID: , request send
│ failed, Put
│ "https://prod-infra-tf-state-2025.s3.us-west-2.amazonaws.com/main-infra/terraform.tfstate?x-id=PutObject":
│ dial tcp: lookup
│ prod-infra-tf-state-2025.s3.us-west-2.amazonaws.com: no such host
The Recovery Process
The beauty of Terraform's design shines in moments like these. Here's the systematic recovery:
Step 1: Force Unlock the State
terraform force-unlock xxx8b0e7e1-4f7c-4cxx-9c2a-xxx71a6a9b1c
The lock ID is conveniently displayed in the error message. This releases the orphaned lock from DynamoDB.
Step 2: Recover the Local State
terraform state push errored.tfstate
This pushes the locally saved state back to your S3 backend, ensuring no infrastructure changes are lost.
Step 3: Assess and Continue
terraform plan
terraform apply
Check what actually got created and complete the deployment.
Lessons Learned
Terraform's Resilience: The tool gracefully handles network failures by saving state locally
State Lock Transparency: Error messages provide all the information needed for recovery
Remote State Benefits: Using S3 + DynamoDB backends provides robust state management even during failures
Infrastructure as Code Reliability: Well-designed IaC can recover from unexpected interruptions
Prevention Tips
Use stable internet connections for critical deployments
Consider running Terraform from cloud instances with reliable connectivity
Monitor deployment progress and be prepared for recovery procedures
Always use remote state backends for production workloads
Conclusion
Network failures during infrastructure deployments are inevitable, but Terraform's state management system makes recovery straightforward. What could have been a disaster—losing track of partially created AWS resources—becomes a minor inconvenience with the right recovery steps.
The production EKS cluster deployment continued successfully after this hiccup, proving that robust tooling and systematic recovery processes are essential for production infrastructure management.
This incident occurred during the deployment of a production EKS cluster with VPC, managed node groups, and security configurations using Terraform modules.
Subscribe to my newsletter
Read articles from Enoch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Enoch
Enoch
I have a passion for automating and optimizing cloud infrastructure. I have experience working with various cloud platforms, including AWS, Azure, and Google Cloud. My goal is to help companies achieve scalable, reliable, and secure cloud environments that drive business success.