kubernetes pods scaling

I came across a seemingly strange issue where none of the pods in a particular namespace were running. The pods in other namespaces had no issues. Trying to explicitly scale the deployments was not successful either. There were no pods coming up.

First, I wanted to find out whats happening with the scaling.

kubectl get hpa -n <namespace>

When I got a list of hpas, then I went to check events for a single hpa

kubectl describe hpa -n <namespace>

Got some helpful messages,

Warning FailedComputeMetricsReplicas 25m (x2129 over 11d) horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API Warning FailedGetResourceMetric 39s (x2228 over 11d) horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API

This shows that the metrics collection is the issue. The hpa is unable to get node metrics and thereby cannot determine if the pods can be scaled. Check the logs of the metrics server to find out whats happening.

kubectl logs -f metrics-server-5b4fc487-zt65w -n kube-system

E0318 07:53:06.478141 1 scraper.go:140] "Failed to scrape node" err="Get "[https://10.0.213.232:10250/metrics/resource\](https://10.0.213.232:10250/metrics/resource)": context deadline exceeded" node="ip-10-0-213-232.ec2.internal" E0318 07:53:21.477955 1 scraper.go:140] "Failed to scrape node" err="Get "[https://10.0.213.232:10250/metrics/resource\](https://10.0.213.232:10250/metrics/resource)": context deadline exceeded" node="ip-10-0-213-232.ec2.internal" E0318 07:53:36.477636 1 scraper.go:140] "Failed to scrape node" err="Get "[https://10.0.213.232:10250/metrics/resource\](https://10.0.213.232:10250/metrics/resource)": context deadline exceeded" node="ip-10-0-213-232.ec2.internal"

This seems to suggest that data from Nodes might not be getting collected. Next thing to do is to check if metrics for nodes are getting collected in metrics server

kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq .

I could see data however this could be cached data and partial. So, lets verify if any are nodes are "Not Ready"

kubectl get nodes -o wide

All the nodes were in Ready state. Then we will verify that kubelet API is reachable by doing the steps below

i. Get into a pod

kubectl run -it --rm --image=busybox debug --restart=Never -- sh

ii. From within the pod, execute. Note: The API endpoint was provided in the logs from the metrics-server.

wget -qO- --timeout=5 https://10.0.213.232:10250/metrics/resource --no-check-certificate

Found below message,

/ # wget -qO- --timeout=5 https://10.0.213.232:10250/metrics/resource --no-check-certificate wget: server returned error: HTTP/1.1 401 Unauthorized

This provided the clue needed. Now, need to find out whats happening in the metrics server.

kubectl get deployment metrics-server -n kube-system -o yaml | grep args -A10

The reason for the Unauthorized exception is due to a recent LB change - some of the external LBs were changed to ingress LBs. In order to overcome this, the --kubelet-insecure-tls flag needs to be added to avoid cert checking.

Now, for resolution, edit the metrics-server deployment to add the flag --kubelet-insecure-tls under args section.

Checking the events of the pod should give a better idea.

kubectl get events -n <namespace> --sort-by=.lastTimestamp

This showed that the metrics server issue has been resolved - no more errors.

After checking the pods now, they started scaling.

💡

Reason: Clean up hygiene. Changing LBs and seemingly unused resources also can affect resources. Need to be mindful of that. Next Step: Automate the cleanup and testing.

Pods scaling issue - Resolved

Subscribe to my newsletter

Ramesh Lakshmipathy

Ramesh Lakshmipathy