Troubleshooting n8n on Kubernetes: Problems and Solutions Guide


This is Part 7 of the "Building a Production-Ready n8n Workflow Automation Platform on Azure Kubernetes Service" series. View the complete series here.
Troubleshooting and Problem Resolution
Welcome to Part 7 of our n8n on AKS series! In Part 6, we implemented monitoring and optimization strategies. Even with the best preparation, issues can arise, so today we'll explore comprehensive troubleshooting techniques.
Even the most carefully designed systems encounter issues. This section provides a comprehensive guide to troubleshooting common problems you might encounter with your n8n deployment on AKS.
Common Issues and Resolutions
Database Connection Issues
Symptoms:
- n8n pods showing errors like
Error: connect ETIMEDOUT
orError: connect ECONNREFUSED
- Database-related error messages in n8n logs
- n8n UI showing database connection errors
Diagnostic Approach:
Check PostgreSQL pod status:
kubectl get pods -n n8n -l app=postgres
Verify PostgreSQL service:
kubectl get svc postgres-service -n n8n
Check database logs:
kubectl logs $(kubectl get pod -l app=postgres -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n
Test database connection from n8n pod:
kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \ node -e "const { Pool } = require('pg'); const pool = new Pool({host: 'postgres-service', user: process.env.DB_POSTGRESDB_USER, password: process.env.DB_POSTGRESDB_PASSWORD, database: 'n8n'}); pool.query('SELECT NOW()', (err, res) => { console.log(err || res.rows[0]); pool.end(); })"
Common Resolutions:
Authentication Issues:
- Verify the PostgreSQL credentials in the Kubernetes secrets
- Ensure the n8n database user exists and has proper permissions
Network Issues:
- Check if pods are in the same namespace
- Verify that the service name resolution works
- Ensure no network policies are blocking the connection
Database Health Issues:
- Check for PostgreSQL resource constraints
- Verify the database isn't in recovery mode
- Check for disk space issues
Queue/Redis Connection Issues
Symptoms:
- Workflows are triggered but stay in "waiting" status
- Error messages like
Error connecting to Redis
in n8n logs - Workers not processing queued workflows
Diagnostic Approach:
Check Redis pod status:
kubectl get pods -n n8n -l app=redis
Verify Redis service:
kubectl get svc redis-service -n n8n
Check Redis logs:
kubectl logs $(kubectl get pod -l app=redis -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n
Test Redis connection from n8n pod:
kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \ node -e "const Redis = require('ioredis'); const redis = new Redis('redis-service'); redis.ping().then(res => { console.log(res); redis.disconnect(); })"
Common Resolutions:
Connection Configuration:
- Verify Redis host and port settings in n8n environment variables
- Check if the Redis service name is correctly specified
Queue Stuck Issues:
- Clear stuck queues with Redis CLI commands
- Restart the Redis pod if necessary
Worker Configuration:
- Ensure workers are configured for queue mode
- Verify workers have the same encryption key as the main n8n instance
SSL/TLS Certificate Issues
Symptoms:
- Browser shows "Your connection is not private" warning
- Certificate errors in browser console
- Ingress controller logs showing certificate issues
Diagnostic Approach:
Check certificate status:
kubectl get certificate -n n8n
Examine certificate details:
kubectl describe certificate n8n-tls-secret -n n8n
Check cert-manager logs:
kubectl logs -n cert-manager -l app=cert-manager
Verify the ClusterIssuer status:
kubectl describe clusterissuer letsencrypt-prod
Common Resolutions:
Domain Validation Issues:
- Ensure DNS is correctly configured to point to your ingress controller IP
- Verify that the HTTP-01 challenge can reach your ingress controller
- Check if Let's Encrypt rate limits have been hit (5 certificates per domain per week)
Configuration Issues:
- Verify email address in ClusterIssuer is valid
- Ensure ingress class is correctly specified
- Check TLS section in Ingress resource matches your domain
Certificate Renewal Issues:
- Manually trigger certificate renewal if needed
- Check if cert-manager CRDs are up to date
- Verify cert-manager has necessary permissions
Ingress and External Access Issues
Symptoms:
- Unable to access n8n UI from the internet
- 404, 502, or other HTTP errors when accessing your domain
- Timeouts when attempting to connect
Diagnostic Approach:
Check Ingress resource status:
kubectl get ingress -n n8n
Verify Ingress controller pods:
kubectl get pods -n default -l app.kubernetes.io/name=ingress-nginx
Check Ingress controller logs:
kubectl logs -n default -l app.kubernetes.io/name=ingress-nginx
Test connectivity to Ingress IP:
curl -v http://<ingress-ip>
Common Resolutions:
DNS Issues:
- Verify DNS A record points to the correct Ingress Controller IP
- Check if DNS propagation is complete (may take up to 48 hours)
- Use
nslookup
ordig
to verify DNS resolution
Ingress Configuration:
- Ensure the Ingress resource specifies the correct service and port
- Verify host rules match your domain exactly
- Check path settings and ensure they match n8n requirements
Network Issues:
- Verify Azure Network Security Groups allow traffic on ports 80 and 443
- Check if any firewalls are blocking access to your AKS cluster
- Ensure the Ingress Controller service is of type LoadBalancer with an external IP
Troubleshooting Decision Tree
The following diagram presents a structured approach to troubleshooting n8n deployment issues:
flowchart TD
start[Issue Detected] --> issue{What type of issue?}
issue -->|UI/Access| ui[UI or Access Issue]
issue -->|Workflow Execution| exec[Workflow Execution Issue]
issue -->|Infrastructure| infra[Infrastructure Issue]
ui --> uiDiag{UI Diagnostic}
uiDiag -->|Cannot reach site| dns[Check DNS & Ingress]
uiDiag -->|SSL Error| cert[Check Certificate]
uiDiag -->|Login Issue| auth[Check Authentication]
uiDiag -->|UI Loads but Errors| n8nui[Check n8n Logs]
dns --> dnsFix[DNS & Ingress Fixes]
cert --> certFix[Certificate Fixes]
auth --> authFix[Auth Fixes]
n8nui --> uiFix[UI Issue Fixes]
exec --> execDiag{Execution Diagnostic}
execDiag -->|Workflow Stuck in Queue| queue[Check Redis & Workers]
execDiag -->|Database Errors| db[Check PostgreSQL]
execDiag -->|Execution Failures| node[Check Node Errors]
queue --> queueFix[Queue & Worker Fixes]
db --> dbFix[Database Fixes]
node --> nodeFix[Node-specific Fixes]
infra --> infraDiag{Infrastructure Diagnostic}
infraDiag -->|Pod Issues| pod[Check Pod Status]
infraDiag -->|Resource Issues| res[Check Resource Usage]
infraDiag -->|Network Issues| net[Check Network Policy]
pod --> podFix[Pod Fixes]
res --> resFix[Resource Fixes]
net --> netFix[Network Fixes]
style start fill:#f96,stroke:#333
style dnsFix,certFix,authFix,uiFix fill:#9f6,stroke:#333
style queueFix,dbFix,nodeFix fill:#9f6,stroke:#333
style podFix,resFix,netFix fill:#9f6,stroke:#333
Advanced Diagnostic Workflows
Database Performance Issues
If workflow execution is slow or database operations are taking too long:
Check database load:
kubectl exec -it $(kubectl get pod -l app=postgres -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \ psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
Identify long-running queries:
kubectl exec -it $(kubectl get pod -l app=postgres -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \ psql -U postgres -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC;"
Check table sizes and index usage:
kubectl exec -it $(kubectl get pod -l app=postgres -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \ psql -U postgres -d n8n -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) AS total_size FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;"
Memory Leak Investigation
If n8n pods are steadily increasing in memory usage:
Get memory usage metrics:
kubectl top pods -n n8n
Check container memory limit and usage:
kubectl describe pod $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n
Generate a heap dump (for advanced debugging):
kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \ node --expose-gc -e "const fs=require('fs'); setTimeout(() => { global.gc(); const heapSnapshot = require('v8').getHeapSnapshot(); const file = fs.createWriteStream('/tmp/heap.json'); heapSnapshot.pipe(file); }, 1000);" kubectl cp n8n/$(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}'):/tmp/heap.json ./heap.json
Analyze the heap dump with Chrome DevTools or a memory analyzer.
Network Connectivity Issues
If services can't communicate with each other:
Test network connectivity between pods:
kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \ nc -zv postgres-service 5432 kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \ nc -zv redis-service 6379
Check if network policies are restricting traffic:
kubectl get networkpolicies -n n8n
Verify DNS resolution:
kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \ nslookup postgres-service kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \ nslookup redis-service
Creating a Diagnostic Information Bundle
For complex issues, it's often helpful to collect comprehensive diagnostic information:
#!/bin/bash
# collect-diagnostics.sh - Collect diagnostic information for n8n deployment
# Create output directory
mkdir -p n8n-diagnostics
cd n8n-diagnostics
# Collect pod information
kubectl get pods -n n8n -o yaml > pods.yaml
kubectl describe pods -n n8n > pods-describe.txt
# Collect logs
for pod in $(kubectl get pods -n n8n -o jsonpath='{.items[*].metadata.name}'); do
kubectl logs $pod -n n8n > $pod-logs.txt
done
# Collect service and endpoint information
kubectl get svc,endpoints -n n8n -o yaml > services.yaml
# Collect ingress and certificate information
kubectl get ingress,certificate -n n8n -o yaml > ingress-cert.yaml
kubectl describe ingress,certificate -n n8n > ingress-cert-describe.txt
# Collect events
kubectl get events -n n8n > events.txt
# Collect resource usage
kubectl top pods -n n8n > pod-resources.txt
kubectl top nodes > node-resources.txt
# Create a tar archive
tar -czf n8n-diagnostics.tar.gz *
echo "Diagnostic information collected in n8n-diagnostics.tar.gz"
Troubleshooting Cheatsheet
Issue | Check Command | Resolution Strategy | |
Pod won't start | kubectl describe pod <pod-name> -n n8n | Check events section for errors | |
Pod crashing | kubectl logs <pod-name> -n n8n | Look for error messages near the end | |
Service unavailable | kubectl get endpoints <service-name> -n n8n | Verify endpoints exist | |
Certificate issues | kubectl describe certificate <cert-name> -n n8n | Check events and conditions | |
Database connection | `kubectl exec -it -n n8n -- env \ | grep DB_` | Verify environment variables |
Redis connection | `kubectl exec -it -n n8n -- env \ | grep REDIS` | Verify environment variables |
Ingress not working | kubectl get ingress <ingress-name> -n n8n | Check ADDRESS field has an IP | |
Resource constraints | kubectl top pods -n n8n | Check for pods near resource limits | |
Webhook not triggering | `kubectl logs -n n8n \ | grep webhook` | Verify webhook URL and connectivity |
Summary
Troubleshooting a production n8n deployment on AKS requires a systematic approach. By understanding:
- Common failure modes and their symptoms
- Diagnostic approaches for each component
- Resolution strategies for different issues
You can quickly identify and resolve problems, minimizing downtime and ensuring a reliable workflow automation platform.
Remember that many issues can be prevented through proper monitoring and proactive maintenance, as discussed in the previous section. When problems do occur, having these troubleshooting procedures documented will significantly reduce the mean time to resolution.
Conclusion
Armed with these troubleshooting strategies and diagnostic workflows, you can quickly identify and resolve issues in your n8n deployment. Remember that systematic investigation and a good understanding of the architecture are key to efficient problem resolution.
In our final article, we'll summarize what we've accomplished, review the benefits of our architecture, and explore advanced enhancements for the future. [Continue to Part 8: Conclusion and Next Steps]
What troubleshooting techniques have you found most effective for Kubernetes applications? Have you encountered any particularly challenging issues with workflow automation systems? Share your stories in the comments!
Subscribe to my newsletter
Read articles from Nikhil Mishra directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Nikhil Mishra
Nikhil Mishra
I am a student studying in Mumbai University, learning DevOps, looking for opportunities to learn more things by gaining experience at prestigious institutions