Ain't No Way That's Taint: Resolving PodAffinity Issues in Kubernetes
While working with taints and tolerations in Kubernetes, I encountered an interesting scenario. I had a cluster with four control-plane nodes, each running an etcd pod except one. My goal was to deploy a log-collection service on each control-plane node to extract logs from the etcd pods across the cluster.
Additionally, I wanted to ensure that each log-collector pod was scheduled on a distinct control-plane node. In other words, I didn't want all the log-collectors to get scheduled on the same control-plane node.
Initial Setup
What we Had
4 control-plane nodes, with 3 running etcd services.
3 etcd pods, each running on a distinct control-plane node.
What we Wanted
- Deploy log-collector pods on the control plane to get the logs for the etcd pod.
The basic idea to achieve this was to use the nodeSelector, podAffinity, podAntiAffinity and taints, tolerations in conjunction. Let's break each of them.
nodeSelector
nodeSelector
is used to filter out worker nodes from control-plane nodes. All control-plane nodes were labeled with type=master
.
kubectl label node <control-plane-node-name> type=master
You can always verify the node labels using the following query.
kubectl get nodes -o json | jq '.items[] | {name: .
metadata.name
, labels: .metadata.labels}'
Now, leveraging this new label we can target only the control-plane nodes to schedule our log-collector pod on.
podAffnity
To ensure that the pod is only scheduled on a control-plane node where etcd is running, we apply a podAffinity
rule. First, label all etcd pods, for example app=etcd
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- etcd
topologyKey: "kubernetes.io/hostname"
podAntiAffinity
Using podAntiAffinity
, we can ensure that log-collector pods are not scheduled on a node that already has a log-collector pod running. This ensures that each eligible control-plane node has one running log-collector pod.
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- myapp
topologyKey: "kubernetes.io/hostname"
Taints and tolerations
The control-plane nodes are tainted with NoSchedule
by default during cluster setup. We need to tolerate this taint by adding a toleration in the pod specification.
"taints": [
{
"effect": "NoSchedule",
"key": "node-role.kubernetes.io/control-plane"
}
]
So, we need to tolerate this taint by adding toleration in the pod specification.
tolerations:
- key: "node-role.kubernetes.io/control-plane"
effect: "NoSchedule"
operator: "Exists"
Combined Deployment Configuration
Combining all these elements, we arrive at a singular deployment YAML configuration:N
apiVersion: apps/v1
kind: Deployment
metadata:
name: log-collector-deployment
spec:
selector:
matchLabels:
app: myapp
replicas: 3
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: log-collector-container
image: nginx:latest
nodeSelector:
type: master # filtering out worker nodes
affinity:
podAntiAffinity: # no multiple log-collectors on same node
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- myapp.
topologyKey: "kubernetes.io/hostname"
podAffinity: # Only schedule where etcd pod is running.
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- etcd
namespaces:
- kube-system
topologyKey: kubernetes.io/hostname
tolerations: # Tolerate the NoSchedule taint.
- key: "node-role.kubernetes.io/control-plane"
effect: "NoSchedule"
operator: "Exists"
Troubleshooting
Despite the configuration, it initially did not work. The scheduling error indicated that the node(s) didn't match the pod affinity rule. After further investigation, I realized the issue was with the namespace. The etcd pod runs under the kube-system
namespace, while we were deploying our deployment in the default
namespace.
The namespace. etcd pod runs under the kube-system namespace while we are trying to deploy our deployment in the default namespace.
With this added knowledge, I modified the yaml as:-
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-deployment
spec:
selector:
matchLabels:
app: myapp
replicas: 2
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp-container
image: nginx:latest
nodeSelector:
type: master
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- myapp
topologyKey: "kubernetes.io/hostname"
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- etcd
namespaces: # the fix.
- kube-system
topologyKey: kubernetes.io/hostname
tolerations:
- key: "node-role.kubernetes.io/control-plane"
effect: "NoSchedule"
operator: "Exists"
After deploying this configuration, the log-collector pods were scheduled as desired.
Hope this help someone and save them their time.
Subscribe to my newsletter
Read articles from Aman Gaur directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Aman Gaur
Aman Gaur
I am a DevOps Engineer from India, currently residing in Halifax, Nova Scotia. I have been working in web development for 9 years professionally. I love debugging container issues, cloud implementation issues in AWS, and love playing with Kubernetes. In my Free time, I love analyzing F1 data.