Pods scaling issue - Resolved


I came across a seemingly strange issue where none of the pods in a particular namespace were running. The pods in other namespaces had no issues. Trying to explicitly scale the deployments was not successful either. There were no pods coming up.
First, I wanted to find out whats happening with the scaling.
kubectl get hpa -n <namespace>
When I got a list of hpas, then I went to check events for a single hpa
kubectl describe hpa -n <namespace>
Got some helpful messages,
Warning FailedComputeMetricsReplicas 25m (x2129 over 11d) horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API Warning FailedGetResourceMetric 39s (x2228 over 11d) horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
This shows that the metrics collection is the issue. The hpa is unable to get node metrics and thereby cannot determine if the pods can be scaled. Check the logs of the metrics server to find out whats happening.
kubectl logs -f metrics-server-5b4fc487-zt65w -n kube-system
E0318 07:53:06.478141 1 scraper.go:140] "Failed to scrape node" err="Get "[https://10.0.213.232:10250/metrics/resource\](https://10.0.213.232:10250/metrics/resource)": context deadline exceeded" node="ip-10-0-213-232.ec2.internal" E0318 07:53:21.477955 1 scraper.go:140] "Failed to scrape node" err="Get "[https://10.0.213.232:10250/metrics/resource\](https://10.0.213.232:10250/metrics/resource)": context deadline exceeded" node="ip-10-0-213-232.ec2.internal" E0318 07:53:36.477636 1 scraper.go:140] "Failed to scrape node" err="Get "[https://10.0.213.232:10250/metrics/resource\](https://10.0.213.232:10250/metrics/resource)": context deadline exceeded" node="ip-10-0-213-232.ec2.internal"
This seems to suggest that data from Nodes might not be getting collected. Next thing to do is to check if metrics for nodes are getting collected in metrics server
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq .
I could see data however this could be cached data and partial. So, lets verify if any are nodes are "Not Ready"
kubectl get nodes -o wide
All the nodes were in Ready state. Then we will verify that kubelet API is reachable by doing the steps below
i. Get into a pod
kubectl run -it --rm --image=busybox debug --restart=Never -- sh
ii. From within the pod, execute. Note: The API endpoint was provided in the logs from the metrics-server.
wget -qO- --timeout=5 https://10.0.213.232:10250/metrics/resource --no-check-certificate
Found below message,
/ # wget -qO- --timeout=5 https://10.0.213.232:10250/metrics/resource --no-check-certificate wget: server returned error: HTTP/1.1 401 Unauthorized
This provided the clue needed. Now, need to find out whats happening in the metrics server.
kubectl get deployment metrics-server -n kube-system -o yaml | grep args -A10
The reason for the Unauthorized exception is due to a recent LB change - some of the external LBs were changed to ingress LBs. In order to overcome this, the --kubelet-insecure-tls flag needs to be added to avoid cert checking.
Now, for resolution, edit the metrics-server deployment to add the flag --kubelet-insecure-tls under args section.
Checking the events of the pod should give a better idea.
kubectl get events -n <namespace> --sort-by=.lastTimestamp
This showed that the metrics server issue has been resolved - no more errors.
After checking the pods now, they started scaling.
Subscribe to my newsletter
Read articles from Ramesh Lakshmipathy directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
