Azure AKS Troubleshooting Hands-On - Connectivity Issues Pods and Services Within Same Cluster

Francisco SouzaFrancisco Souza
8 min read

📝Introduction

In this hands-on lab, we will guide for troubleshooting a real scenario in Azure Kubernetes Service (AKS) for a common issue: Connectivity issues Pods and Services within the same cluster.

Learning objectives:

In this module, you'll learn how to:

  • Identify the issue

  • Resolve the issue

📝Log in to the Azure Management Console

Using your credentials, make sure you're using the right Region. In my case, I am using the region uksouth in my Cloud Playground Sandbox.

📌Note: You can also use the VSCode tool or from your local Terminal to connect to Azure CLI

More information on how to set it up is at the link.

📝Prerequisites:

  • Update to PowerShell 5.1, if needed.

  • Install .NET Framework 4.7.2 or later.

  • Visual Code

  • Web Browser (Chrome, Edge)

  • Azure CLI installed

  • Azure subscription

  • Docker installed

📝Setting an Azure Storage Account to Load Bash or PowerShell

  • Click the Cloud Shell icon (>_) at the top of the page.

  • Click PowerShell.

  • Click Show Advanced Settings. Use the combo box under Cloud Shell region to select the Region. Under Resource Group and Storage account(It's a globally unique name), enter a name for both. In the box under File Share, enter a name. Click ***Create storage (***if you don't have any yet).

📝Default settings for this Lab

  1. First, register the following resources:
az login 
az account set --subscription <subscriptionID> 
az provider register --namespace Microsoft.Storage
az provider register --namespace Microsoft.Compute 
az provider register --namespace Microsoft.Network 
az provider register --namespace Microsoft.Monitor 
az provider register --namespace Microsoft.ManagedIdentity 
az provider register --namespace Microsoft.OperationalInsights 
az provider register --namespace Microsoft.OperationsManagement 
az provider register --namespace Microsoft.KeyVault 
az provider register --namespace Microsoft.ContainerService 
az provider register --namespace Microsoft.Kubernetes
  1. Let's check all the resources registered:

     az provider list --query "[?registrationState=='Registered'].namespace"
    

📝Create log analytics workspace

Save its Resource Id to use in aks create, but before storing the name of some variables.

$resource_group="akslabs"         
$aks_name="akslabs"            
$location="uksouth"                
$aks_vnet="$aks_name-vnet"        
$aks_subnet="$aks_name-subnet"    
$log_analytics_workspace_name="$aks_name-log"

az monitor log-analytics workspace create `
    --resource-group $resource_group `
    --workspace-name $log_analytics_workspace_name `
    --location $location --no-wait

$log_analytics_workspace_resource_id=az monitor log-analytics workspace show `
    --resource-group $resource_group `
    --workspace-name $log_analytics_workspace_name `
    --query id --output tsv

echo $log_analytics_workspace_resource_id

📝Create custom AKS VNet and Subnet

Save its Resource Id to use in aks create.

az network vnet create `
  --resource-group $resource_group `
  --name $aks_vnet `
  --address-prefixes 10.224.0.0/12 `
  --subnet-name $aks_subnet `
  --subnet-prefixes 10.224.0.0/16

$vnet_subnet_id=az network vnet subnet show `
    --resource-group $resource_group `
    --vnet-name $aks_vnet  `
    --name $aks_subnet `
    --query id --output tsv

echo $vnet_subnet_id

📝 Create a AKS Cluster using custom VNet subnet

  1. Create an AKS cluster using the az aks create command:

     az aks create --resource-group $resource_group --name $aks_name --node-count 1 --node-vm-size standard_ds2_v2 --network-plugin kubenet -vnet-subnet-id $vnet_subnet_id --enable-managed-identity --network-policy calico --enable-addons monitoring --generate-ssh-keys --workspace-resource-id $log_analytics_workspace_resource_id
    

📝Connect to AKS Cluster

Use the Azure Portal to check your AKS Cluster resources, by following the steps below:

  1. Go to Azure Dashboard, and click on the Resource Group created for this Lab, looking for your AKS Cluster resource.

  2. On the Overview tab, click on Connect to your AKS Cluster**.**

  3. A new window will be opened, so you only need to open the Azure CLI and run the following commands:

az login
az account set subscription <your-subscription-id>
az aks get-credentials -g $resource_group -n $aks_name --overwrite-existing

After that, you can run some Kubectl commands to check the default AKS Cluster resources.

📝Switch to a new namespace

Create a new name space and switch to working on it.

k create ns <name>
k config set-context --current --namespace=<name>
# Verify current namespace
k config view --minify --output 'jsonpath={..namespace}' 
# Confirm ability to view 
k get pods -A

📝Enable AKS Diagnostics logging using CLI

$aks_resource_id=az aks show `
    --resource-group $resource_group `
    --name $aks_name `
    --query id `
    --output tsv

echo $aks_resource_id

az monitor diagnostic-settings create `
    --name aks_diagnostics `
    --resource $aks_resource_id `
    --logs '[{"category":"kube-apiserver","enabled":true},{"category":"kube-controller-manager","enabled":true},{"category":"kube-scheduler","enabled":true},{"category":"kube-audit","enabled":true},{"category":"cloud-controller-manager","enabled":true},{"category":"cluster-autoscaler","enabled":true},{"category":"kube-audit-admin","enabled":true}]' `
    --metrics '[{"category":"AllMetrics","enabled":true}]' `
    --workspace $log_analytics_workspace_resource_id

📝Create Deployments and Services:

  1. Create two deployments and respective services:

     k create deployment <app-name-1> --image=nginx
     k create deployment <app-name-2> --image=nginx
    
     k expose deployment <app-name-1> --name <app-name-1>-svc --port=80 --target-port=80 --type=ClusterIP
     k expose deployment <app-name-2> --name <app-name-2>-svc --port=80 --target-port=80 --type=ClusterIP
    

Please, confirm if deployment and service are functional. Pods should be running and services listening on Port 80.

  1. Verify that we can access both services from within the cluster by using Cluster IP addresses:

     # Services returned
     k get svc
    
     # Get the value of running Pods
     k get pods
    
     # We should see the NGINX HTML page from each Pod
     k exec -it <app-name-1> -- curl <app-name-1-svc>:80    
     k exec -it <app-name-2> -- curl <app-name-2-svc>:80
    
     # Check endpoints for the services
     k get ep
    

  2. Backup existing deployments and services:

    Let's to backup the deployment associated with nginx-2 deployment and service.

     k get deployment.apps/<app-name-2> -o yaml > <app-name-2>-dep.yaml
     k get service/<app-name-2-svc> -o yaml > <app-name-2-svc>.yaml
    

📝Simulate the Issue:

Let's to delete the previous created resources and create new ones using the following YAML file to simulate the connectivity issue between the Pods within the same cluster. I called this file as downservice.yaml.

k delete -f <app-name-2>-dep.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  labels:
    app: <app-name-2>
  name: <app-name-2>
  namespace: <name>
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: <app-name-02>
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: <app-name-02>
    spec:
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

Apply the downservice.yaml deployment file and confirm all Pods are running using the commands:

k apply -f downservice.yaml
k get all

📝Troubleshoot the issue:

  1. Check the health of the nodes in the cluster to see if there is a node issue:

     k get nodes
    
  2. Verify that we can no longer access nginx-2-svc from within the cluster:

     k exec -it <app-name-1-pod> -- curl <app-name-2>-svc:80 
     # msg Failed to connect to nginx-2-svc port 80 after 1 ms: Couldn't connect to servercommand terminated with exit code 7
    

  3. Verify that you can access <app-name-1>-svc from within the cluster:

     k exec -it <app-name-1-pod> -- curl <app-name-1>-svc:80
     # displays HTML page
    

  4. Verify that the right Endpoints line up with their Services. There should be at least 1 Pod associated with a service to <app-name-1>, but none to <app-name-2>.

     k get ep
    

  5. Check label selector used by the Service experiencing issue, ensure that it matches the label selector used by its corresponding Deployment.

     k describe service <service-name>
     k describe deployment <deployment_name>
     k get svc
     k get deployment
    

  6. Using the Service label selector from Step 4, check that the Pods selected by the Service match the Pods created by the Deployment.

    If no results are returned then there must be a label selector mismatch. From below selector used by deployment returns pods but not selector used by service.

      kubectl get pods --selector=<selector_used_by_service>
    

  7. Check service and pod logs and ensure HTTP traffic is seen. Compare nginx-1 pod and service logs with nginx-2. Previously, it was suggesting no incoming traffic.

     k logs pod/<app-name-2>
     k logs pod/<app-name-1>
    
     k logs svc/<app-name-2>
     k logs svc/<app-name-1>
    

📝Solving the Issue:

  1. Check the label selector from the Service associated with the issue and get associated pods:

     k describe service <app-name-2>-svc
     k describe pods -l app=<app-name-2>
    
  2. Update deployment and apply changes:

     k delete -f <app-name-2>-dep.yaml
     #In downservice.yaml, Update labels > app: <app-name-02>, to app: <app-name-2>
     k apply -f downservice.yaml 
     k describe pod <app-name-2>
     k get ep
    

  3. Verify that we can now access the newly created service from within the cluster:

     # Should return HTML page from nginx-2-svc
     kubectl exec -it <app-name-1 pod> -- curl <app-name-2>-svc:80    
    
     # Confirm from logs
     k logs pod/<app-name-2>
    

📝Extra Troubleshooting Using Custom Domain Names

Currently Services in our namespace will resolve using ..svc.cluster.local.

k exec -it <app-name-1 pod> -- curl <app-name-2>-svc.test.svc.cluster.local

  1. Let's to create a custom yaml file called downservice2.yaml to break it:

     apiVersion: v1
     kind: ConfigMap
     metadata:
       name: coredns-custom
       namespace: kube-system
     data:
       internal-custom.override: | # any name with .server extension
         rewrite stop {
           name regex (.*)\.svc\.cluster\.local {1}.bad.cluster.local.
           answer name (.*)\.bad\.cluster\.local {1}.svc.cluster.local.
         }
    
  2. Apply the downservice2.yaml file and and restart coredns:

     k apply -f downservice2.yaml
     k delete pods -l=k8s-app=kube-dns -n kube-system
     # Monitor to ensure pods are running
     k get pods -l=k8s-app=kube-dns -n kube-system
    

  3. Validate if DNS resolution works and it should fail:

     k exec -it -- curl <app-name-2>-svc.test.svc.cluster.local
    

  4. Check the DNS configuration files in kube-system which shows the configmaps, as below.

     k get cm -A -n kube-system | grep dns
    

  5. Describe each of the ones found above and look for inconsistencies:

     k describe cm coredns -n kube-system 
     k describe cm coredns-autoscaler -n kube-system 
     k describe cm coredns-custom -n kube-system
    

  6. To solve the issue, since the custom DNS file holds wrong parameters breaking it, either edit coredns-custom (to remove data section) or delete configmap and restart DNS.

     k delete cm coredns-custom -n kube-system
     kubectl delete pods -l=k8s-app=kube-dns -n kube-system
     # Monitor to ensure pods are running
     kubectl get pods -l=k8s-app=kube-dns -n kube-system
    

  7. Confirm DNS resolution now works as before.

     k exec -it -- curl <app-name-2>.test.svc.cluster.local
    

📌Note - At the end of each hands-on Lab, always clean up all resources previously created to avoid being charged.

Congratulations — you have completed this hands-on lab covering the basics of Troubleshooting an AKS Connectivity issues between Pods and Services within the same cluster.

Thank you for reading. I hope you understood and learned something helpful from my blog.

Please follow me on Cloud&DevOpsLearn and LinkedIn, franciscojblsouza

0
Subscribe to my newsletter

Read articles from Francisco Souza directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Francisco Souza
Francisco Souza

I have over 20 years of experience in IT Infrastructure and currently work at Microsoft as an Azure Kubernetes Support Engineer, where I support and manage the AKS, ACI, ACR, and ARO tools. Previously, I worked as a Fault Management Cloud Engineer at Nokia for 2.9 years, with expertise in OpenStack, Linux, Zabbix, Commvault, and other tools. In this role, I resolved critical technical incidents, ensured consistent uptime, and safeguarded against revenue loss from customers. Additionally, I briefly served as a Technical Team Lead for 3 months, where I distributed tasks, mentored a new team member, and managed technical requests and activities raised by our customers. Previously, I worked as an IT System Administrator at BN Paribas Cardif Portugal and other significant companies in Brazil, including an affiliate of Rede Globo Television (Rede Bahia) and Petrobras SA. In these roles, I developed a robust skill set, acquired the ability to adapt to new processes, demonstrated excellent problem-solving and analytical skills, and managed ticket systems to enhance the customer service experience. My ability to thrive in high-pressure environments and meet tight deadlines is a testament to my organizational and proactive approach. By collaborating with colleagues and other teams, I ensure robust support and incident management, contributing to the consistent satisfaction of my customers and the reliability of the entire IT Infrastructure.