Azure AKS Troubleshooting Hands-On - Pod Failing to Insufficient Resources

Francisco SouzaFrancisco Souza
4 min read

📝Introduction

In this hands-on lab, we will guide for troubleshooting a real scenario in Azure Kubernetes Service (AKS) for a common issue: a Pod failing to start due to insufficient resources.

Learning objectives:

In this module, you'll learn how to:

  • Identify the issue

  • Resolve the issue

📝Log in to the Azure Management Console

Using your credentials, make sure you're using the right Region. In my case, I am using the region uksouth in my Cloud Playground Sandbox.

📌Note: You can also use the VSCode tool or from your local Terminal to connect to Azure CLI

More information on how to set it up is at the link.

📝Prerequisites:

  • Update to PowerShell 5.1, if needed.

  • Install .NET Framework 4.7.2 or later.

  • Visual Code

  • Web Browser (Chrome, Edge)

  • Azure CLI installed

  • Azure subscription

  • Docker installed

📝Setting an Azure Storage Account to Load Bash or PowerShell

  • Click the Cloud Shell icon (>_) at the top of the page.

  • Click PowerShell.

  • Click Show Advanced Settings. Use the combo box under Cloud Shell region to select the Region. Under Resource Group and Storage account(It's a globally unique name), enter a name for both. In the box under File Share, enter a name. Click ***Create storage (***if you don't have any yet).

📝Create an AKS Cluster

  1. Create an AKS cluster using the az aks create command, but before storing the name of the cluster inside a variable named CLUSTERNAME.

    Copy

      CLUSTERNAME=<AKSClusterName>
      az aks create -n $CLUSTERNAME -g $RG --node-vm-size Standard_D2s_v3 --node-count 2 --generate-ssh-keys
    

📝 Connect to AKS Cluster

Use the Azure Cloud Shell to check your AKS Cluster resources, by following the steps below:

  1. Go to Azure Dashboard, and click on the Resource Group created for this Lab, looking for your AKS Cluster resource.

  2. On the Overview tab, click on Connect to your AKS Cluster.

  3. A new window will be opened, so you only need to open the Azure CLI and run the following commands:

az login
az account set subscription <your-subscription-id>
az aks get-credentials -g <nameRersourceGroup> -n <nameAKSCluster> --overwrite-existing

After that, you can run some Kubectl commands to check the default AKS Cluster resources.

📝Deploy the Application to AKS

  1. Simulate the Issue:

    Deploy a Sample Application: Create a deployment YAML file (nginx-deployment.yaml) with resource requests that exceed the available resources on the node:

     apiVersion: apps/v1
     kind: Deployment
     metadata:
       name: nginx-deployment
     spec:
       replicas: 1
       selector:
         matchLabels:
           app: nginx
       template:
         metadata:
           labels:
             app: nginx
         spec:
           containers:
           - name: nginx
             image: nginx:latest
             resources:
               requests:
                 memory: "2Gi"
                 cpu: "2"
               limits:
                 memory: "2Gi"
                 cpu: "2"
    
  2. Apply the Deployment:

     kubectl apply -f nginx-deployment.yaml
    

  3. Identify the Issue:

    • Check Pod Status:
    kubectl get pods

  • Describe the Pod:
    kubectl describe pod <pod-name>

Look for events indicating why the pod is not starting. You might see messages like “Insufficient cpu” or “Insufficient memory”.

  1. Troubleshoot the Issue:

  • Check Node Resources:
kubectl top nodes

Verify the available CPU and memory on the nodes.

  • Check Resources Quotas (if any):
kubectl get resourcequotas

  • Check Cluster Autoscaler: Ensure the cluster autoscaler is enabled and configured correctly:

      az aks show -g <nameRersourceGroup> -n <nameAKSCluster> --query "agentPoolProfiles[].enableAutoScaling"
    

  1. Resolve the Issue:

    • Scale Up the Cluster: If the cluster autoscaler is not enabled or not sufficient, maybe manually scale up the cluster is the solution:

        az aks scale -g <nameRersourceGroup> -n <nameAKSCluster> --node-count <new-node-count>
      
    • Adjust Resource Requests: Modify the deployment YAML file to request fewer resources:

        resources:
          requests:
            memory: "1Gi"
            cpu: "1"
          limits:
            memory: "1Gi"
            cpu: "1"
      
    • Reapply the Deployment:

        kubectl apply -f nginx-deployment.yaml
      
  1. Verify the Resolution:

  • Check Pod Status Again:

kubectl get pods

  • Describe the Pod:

      kubectl describe pod <pod-name>
    

Ensure there are no error messages and the pod is running.

  • Check Node Resources:

      kubectl top nodes
    

Verify that the nodes have sufficient resources and the pod is running smoothly.

📌Note - At the end of each hands-on Lab, always clean up all resources previously created to avoid being charged.

Congratulations — you have completed this hands-on lab covering the basics of Troubleshooting an AKS Pod failing to start due to insufficient resources.

Thank you for reading. I hope you understood and learned something helpful from my blog.

Please follow me on Cloud&DevOpsLearn and LinkedIn, franciscojblsouza

0
Subscribe to my newsletter

Read articles from Francisco Souza directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Francisco Souza
Francisco Souza

I have over 20 years of experience in IT Infrastructure and currently work at Microsoft as an Azure Kubernetes Support Engineer, where I support and manage the AKS, ACI, ACR, and ARO tools. Previously, I worked as a Fault Management Cloud Engineer at Nokia for 2.9 years, with expertise in OpenStack, Linux, Zabbix, Commvault, and other tools. In this role, I resolved critical technical incidents, ensured consistent uptime, and safeguarded against revenue loss from customers. Additionally, I briefly served as a Technical Team Lead for 3 months, where I distributed tasks, mentored a new team member, and managed technical requests and activities raised by our customers. Previously, I worked as an IT System Administrator at BN Paribas Cardif Portugal and other significant companies in Brazil, including an affiliate of Rede Globo Television (Rede Bahia) and Petrobras SA. In these roles, I developed a robust skill set, acquired the ability to adapt to new processes, demonstrated excellent problem-solving and analytical skills, and managed ticket systems to enhance the customer service experience. My ability to thrive in high-pressure environments and meet tight deadlines is a testament to my organizational and proactive approach. By collaborating with colleagues and other teams, I ensure robust support and incident management, contributing to the consistent satisfaction of my customers and the reliability of the entire IT Infrastructure.