Kubernetes Troubleshooting #3: Kubernetes CronJob That Floods Your Cluster


🚨 Fixing a Kubernetes CronJob That Floods Your Cluster: Lessons from a Real Incident
"A single CronJob brought our microservices to a halt. Here's what happened, how we found the issue, and how we fixed it."
🧩 The Problem
We recently encountered a critical issue in our EKS cluster managed via ArgoCD. One of our Kubernetes CronJobs started spawning multiple pods continuously. Before we noticed, all nodes were filled with CronJob pods, leaving no space for our regular microservices.
Microservices started failing to schedule, and users began reporting degraded performance. The root cause? A runaway CronJob.
⚠️ Symptoms We Observed
CronJob was creating pods every minute or even faster.
Nodes were filled to capacity — CPU/memory exhausted.
Microservices (Deployments/StatefulSets) showed
Pending
state.Cluster was slowing down; node autoscaler couldn’t keep up.
ArgoCD showed consistent green (it wasn’t aware of the runtime chaos).
🧠 Common Causes of CronJob Pod Flooding
If you’ve seen this happen, you’re not alone. These are common scenarios that can cause this issue:
Job Duration > Schedule Interval:
- If a CronJob runs every minute but takes 5 minutes to complete, multiple pods will stack up.
Missed Schedules and Backlog:
- Cluster downtime or CronJob controller issues might result in missed schedules. On recovery, Kubernetes tries to run all missed schedules if
startingDeadlineSeconds
isn't set.
- Cluster downtime or CronJob controller issues might result in missed schedules. On recovery, Kubernetes tries to run all missed schedules if
No
concurrencyPolicy
:- Default policy (
Allow
) lets multiple jobs run concurrently. If previous pods don’t finish, new ones still start.
- Default policy (
No Resource Limits:
- Pods consume unbounded resources, choking out other workloads.
Job Failures with RestartBackoff:
- Failed jobs retrying with exponential backoff may flood the system.
🔍 How We Identified the Root Cause
We used the following tools and steps:
1. Checked the CronJob Behavior
kubectl get cronjob <job-name> -n <namespace> -o yaml
We noticed:
schedule: "* * * * *"
No
concurrencyPolicy
definedNo
startingDeadlineSeconds
set
2. Observed Pods
kubectl get pods -n <namespace> --selector=job-name=<job-name>
Hundreds of pods running or stuck in Running
or Pending
state.
3. Analyzed Logs
kubectl logs <cronjob-pod> -n <namespace>
Found the jobs were taking longer to complete or failing silently.
4. Checked Node Resources
kubectl describe node <node-name>
Revealed that all CPU/memory was consumed by these pods.
✅ How We Fixed It
🔒 Step 1: Pause the CronJob
kubectl patch cronjob <job-name> -n <namespace> -p '{"spec" : {"suspend" : true }}'
This immediately stopped the creation of new pods.
🛠 Step 2: Clean Up Old Pods
kubectl delete jobs --selector=job-name=<job-name> -n <namespace>
kubectl delete pods --selector=job-name=<job-name> -n <namespace>
Freed up space on nodes.
⚙️ Step 3: Update CronJob Configuration
spec:
schedule: "*/5 * * * *" # Less frequent
concurrencyPolicy: Forbid # Prevents overlapping runs
startingDeadlineSeconds: 60 # Prevents old job flooding after downtime
jobTemplate:
spec:
backoffLimit: 1
activeDeadlineSeconds: 300 # Ensure job doesn’t run forever
template:
spec:
containers:
- name: job
image: my-image
resources:
limits:
cpu: "500m"
memory: "512Mi"
requests:
cpu: "100m"
memory: "128Mi"
🚦 Step 4: Re-enable the CronJob After Fix
kubectl patch cronjob <job-name> -n <namespace> -p '{"spec" : {"suspend" : false }}'
🔒 Proactive Measures to Avoid This in Future
✅ Always set
concurrencyPolicy: Forbid
orReplace
✅ Use
startingDeadlineSeconds
to limit backlogged job execution✅ Set resource
limits
andrequests
✅ Set
activeDeadlineSeconds
to prevent runaway jobs✅ Enable alerts in Prometheus/Grafana for:
Pod count spikes
Node resource saturation
Job duration anomalies
📘 Final Thoughts
ArgoCD may show “green” for a CronJob because it doesn't monitor runtime behavior. Always complement GitOps with runtime observability like Prometheus, Loki, and Grafana. This issue taught us the importance of:
Setting sane defaults for jobs
Monitoring job behavior post-deployment
Preparing for edge cases like backlog and concurrency
💬 Have you faced a similar issue? How did you fix it? Let’s connect in the comments!
🛡️ Stay vigilant, and may your CronJobs be lean and reliable!
Subscribe to my newsletter
Read articles from lokeshmatetidevops1 directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

lokeshmatetidevops1
lokeshmatetidevops1
I am DevOps Specialist with over 15+ years of experience in CI/CD, automation, cloud infrastructure, and microservices deployment. Proficient in tools like Jenkins, GitLab CI, ArgoCD, Docker, Kubernetes (EKS), Helm, Terraform, and AWS. Skilled in scripting with Python, Shell, and Perl to streamline processes and enhance productivity. Experienced in monitoring and optimizing Kubernetes clusters using Prometheus and Grafana. Passionate about continuous learning, mentoring teams, and sharing insights on DevOps best practices.