Common ECS Container & Image Mistakes That Cause Downtime

This is Part 2 of 5 in my series on keeping ECS deployments rock-solid — covering best practices, hidden pitfalls, and the sneaky issues that cause downtime.

One of the most frustrating ECS issues I’ve dealt with is the dreaded crash loop.
You deploy your service, the task starts… then stops… then starts again… over and over.

No traffic is getting through, your ALB target group is empty, and ECS is burning through restarts like there’s no tomorrow.

Here are 3 common reasons I’ve seen ECS tasks get stuck in a crash loop — and how to fix them.

1️⃣ Bad or Missing Environment Variables

The Problem:
Containers often depend on environment variables for database connections, API keys, or service URLs.
If one is missing or incorrect, your app might fail immediately at startup.

Fix:

Double-check the environment or secrets section in your ECS task definition.
Make sure sensitive values are coming from AWS Secrets Manager or SSM Parameter Store — not hardcoded.
Test locally with the same environment variables to confirm the app boots successfully.

2️⃣ Application Port Mismatch

The Problem:
Your ECS task definition might expose port 8080, but your container is actually listening on 3000 (or vice versa).
The ALB can’t connect, the health check fails, and ECS kills the container.

Fix:

Make sure the container EXPOSE in the Dockerfile matches the containerPort in the ECS task definition.
Update your ALB target group to check the correct port.
If you’re running multiple containers in a task, confirm they’re not fighting over the same port.

3️⃣ Crash on Startup Due to Code Errors

The Problem:
If your application throws an unhandled error at boot (e.g., database unreachable, missing config file), ECS will mark the task as stopped almost instantly.

Fix:

Use docker run locally with the exact same image you push to ECS.
Check logs in CloudWatch (/ecs/service-name) for error messages.
Add proper retry logic in your app for external dependencies so it can survive transient failures.

💡 Bonus Tip:
If you’re stuck, set the ECS service minimum healthy percent to 0 temporarily.
This lets ECS replace all tasks at once — useful for breaking out of a crash loop when all running tasks are broken.

Final Thought

Crash loops are rarely random.
They usually come down to bad configs, mismatched ports, or missing dependencies.
Get into the habit of checking environment variables, ports, and logs first — it’ll save you hours of guesswork.

Part 2: ECS Tasks Stuck in a Crash Loop? Here’s What to Check

1️⃣ Bad or Missing Environment Variables

2️⃣ Application Port Mismatch

3️⃣ Crash on Startup Due to Code Errors

Final Thought

Subscribe to my newsletter

Alamin Islam

Alamin Islam