Azure AKS Troubleshooting Hands-On - Pod Failing to Insufficient Resources
Table of contents
📝Introduction
In this hands-on lab, we will guide for troubleshooting a real scenario in Azure Kubernetes Service (AKS) for a common issue: a Pod failing to start due to insufficient resources.
Learning objectives:
In this module, you'll learn how to:
Identify the issue
Resolve the issue
📝Log in to the Azure Management Console
Using your credentials, make sure you're using the right Region. In my case, I am using the region uksouth
in my Cloud Playground Sandbox.
📌Note: You can also use the VSCode tool or from your local Terminal to connect to Azure CLI
More information on how to set it up is at the link.
📝Prerequisites:
Update to PowerShell 5.1, if needed.
Install .NET Framework 4.7.2 or later.
Visual Code
Web Browser (Chrome, Edge)
Azure CLI installed
Azure subscription
Docker installed
📝Setting an Azure Storage Account to Load Bash or PowerShell
- Click the Cloud Shell icon
(>_)
at the top of the page.
- Click PowerShell.
- Click Show Advanced Settings. Use the combo box under Cloud Shell region to select the Region. Under Resource Group and Storage account(It's a globally unique name), enter a name for both. In the box under File Share, enter a name. Click ***Create storage (***if you don't have any yet).
📝Create an AKS Cluster
Create an AKS cluster using the
az aks create
command, but before storing the name of the cluster inside a variable named CLUSTERNAME.Copy
CLUSTERNAME=<AKSClusterName> az aks create -n $CLUSTERNAME -g $RG --node-vm-size Standard_D2s_v3 --node-count 2 --generate-ssh-keys
📝 Connect to AKS Cluster
Use the Azure Cloud Shell to check your AKS Cluster resources, by following the steps below:
Go to Azure Dashboard, and click on the Resource Group created for this Lab, looking for your AKS Cluster resource.
On the Overview tab, click on Connect to your AKS Cluster.
-
A new window will be opened, so you only need to open the Azure CLI and run the following commands:
az login
az account set subscription <your-subscription-id>
az aks get-credentials -g <nameRersourceGroup> -n <nameAKSCluster> --overwrite-existing
After that, you can run some Kubectl commands to check the default AKS Cluster resources.
📝Deploy the Application to AKS
Simulate the Issue:
Deploy a Sample Application: Create a deployment YAML file (
nginx-deployment.yaml
) with resource requests that exceed the available resources on the node:apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment spec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx:latest resources: requests: memory: "2Gi" cpu: "2" limits: memory: "2Gi" cpu: "2"
Apply the Deployment:
kubectl apply -f nginx-deployment.yaml
Identify the Issue:
- Check Pod Status:
kubectl get pods
- Describe the Pod:
kubectl describe pod <pod-name>
Look for events indicating why the pod is not starting. You might see messages like “Insufficient cpu” or “Insufficient memory”.
Troubleshoot the Issue:
- Check Node Resources:
kubectl top nodes
Verify the available CPU and memory on the nodes.
- Check Resources Quotas (if any):
kubectl get resourcequotas
Check Cluster Autoscaler: Ensure the cluster autoscaler is enabled and configured correctly:
az aks show -g <nameRersourceGroup> -n <nameAKSCluster> --query "agentPoolProfiles[].enableAutoScaling"
Resolve the Issue:
Scale Up the Cluster: If the cluster autoscaler is not enabled or not sufficient, maybe manually scale up the cluster is the solution:
az aks scale -g <nameRersourceGroup> -n <nameAKSCluster> --node-count <new-node-count>
Adjust Resource Requests: Modify the deployment YAML file to request fewer resources:
resources: requests: memory: "1Gi" cpu: "1" limits: memory: "1Gi" cpu: "1"
Reapply the Deployment:
kubectl apply -f nginx-deployment.yaml
Verify the Resolution:
Check Pod Status Again:
kubectl get pods
Describe the Pod:
kubectl describe pod <pod-name>
Ensure there are no error messages and the pod is running.
Check Node Resources:
kubectl top nodes
Verify that the nodes have sufficient resources and the pod is running smoothly.
📌Note - At the end of each hands-on Lab, always clean up all resources previously created to avoid being charged.
Congratulations — you have completed this hands-on lab covering the basics of Troubleshooting an AKS Pod failing to start due to insufficient resources.
Thank you for reading. I hope you understood and learned something helpful from my blog.
Please follow me on Cloud&DevOpsLearn and LinkedIn, franciscojblsouza
Subscribe to my newsletter
Read articles from Francisco Souza directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Francisco Souza
Francisco Souza
I have over 20 years of experience in IT Infrastructure and currently work at Microsoft as an Azure Kubernetes Support Engineer, where I support and manage the AKS, ACI, ACR, and ARO tools. Previously, I worked as a Fault Management Cloud Engineer at Nokia for 2.9 years, with expertise in OpenStack, Linux, Zabbix, Commvault, and other tools. In this role, I resolved critical technical incidents, ensured consistent uptime, and safeguarded against revenue loss from customers. Additionally, I briefly served as a Technical Team Lead for 3 months, where I distributed tasks, mentored a new team member, and managed technical requests and activities raised by our customers. Previously, I worked as an IT System Administrator at BN Paribas Cardif Portugal and other significant companies in Brazil, including an affiliate of Rede Globo Television (Rede Bahia) and Petrobras SA. In these roles, I developed a robust skill set, acquired the ability to adapt to new processes, demonstrated excellent problem-solving and analytical skills, and managed ticket systems to enhance the customer service experience. My ability to thrive in high-pressure environments and meet tight deadlines is a testament to my organizational and proactive approach. By collaborating with colleagues and other teams, I ensure robust support and incident management, contributing to the consistent satisfaction of my customers and the reliability of the entire IT Infrastructure.