While working with taints and tolerations in Kubernetes, I encountered an interesting scenario. I had a cluster with four control-plane nodes, each running an etcd pod except one. My goal was to deploy a log-collection service on each control-plane node to extract logs from the etcd pods across the cluster.

Additionally, I wanted to ensure that each log-collector pod was scheduled on a distinct control-plane node. In other words, I didn't want all the log-collectors to get scheduled on the same control-plane node.

Initial Setup

What we Had

4 control-plane nodes, with 3 running etcd services.
3 etcd pods, each running on a distinct control-plane node.

What we Wanted

Deploy log-collector pods on the control plane to get the logs for the etcd pod.

The basic idea to achieve this was to use the nodeSelector, podAffinity, podAntiAffinity and taints, tolerations in conjunction. Let's break each of them.

nodeSelector

nodeSelector is used to filter out worker nodes from control-plane nodes. All control-plane nodes were labeled with type=master.

kubectl label node <control-plane-node-name> type=master

You can always verify the node labels using the following query.

kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, labels: .metadata.labels}'

Now, leveraging this new label we can target only the control-plane nodes to schedule our log-collector pod on.

podAffnity

To ensure that the pod is only scheduled on a control-plane node where etcd is running, we apply a podAffinity rule. First, label all etcd pods, for example app=etcd

podAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - etcd
      topologyKey: "kubernetes.io/hostname"

podAntiAffinity

Using podAntiAffinity, we can ensure that log-collector pods are not scheduled on a node that already has a log-collector pod running. This ensures that each eligible control-plane node has one running log-collector pod.

podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
  - labelSelector:
      matchExpressions:
      - key: app
        operator: In
        values:
        - myapp
    topologyKey: "kubernetes.io/hostname"

Taints and tolerations

The control-plane nodes are tainted with NoSchedule by default during cluster setup. We need to tolerate this taint by adding a toleration in the pod specification.

"taints": [
    {
      "effect": "NoSchedule",
      "key": "node-role.kubernetes.io/control-plane"
    }
  ]

So, we need to tolerate this taint by adding toleration in the pod specification.

tolerations:
  - key: "node-role.kubernetes.io/control-plane"
    effect: "NoSchedule"
    operator: "Exists"

Combined Deployment Configuration

Combining all these elements, we arrive at a singular deployment YAML configuration:N

apiVersion: apps/v1
kind: Deployment
metadata:
  name: log-collector-deployment
spec:
  selector:
    matchLabels:
      app: myapp
  replicas: 3
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: log-collector-container
        image: nginx:latest
      nodeSelector:
        type: master            # filtering out worker nodes
      affinity:
        podAntiAffinity:        # no multiple log-collectors on same node
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - myapp.       
            topologyKey: "kubernetes.io/hostname"
        podAffinity:            # Only schedule where etcd pod is running.          
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - etcd
            namespaces:
              - kube-system
            topologyKey: kubernetes.io/hostname
      tolerations:            # Tolerate the NoSchedule taint.
      - key: "node-role.kubernetes.io/control-plane"
        effect: "NoSchedule"
        operator: "Exists"

Troubleshooting

Despite the configuration, it initially did not work. The scheduling error indicated that the node(s) didn't match the pod affinity rule. After further investigation, I realized the issue was with the namespace. The etcd pod runs under the kube-system namespace, while we were deploying our deployment in the default namespace.

The namespace. etcd pod runs under the kube-system namespace while we are trying to deploy our deployment in the default namespace.

With this added knowledge, I modified the yaml as:-

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-deployment
spec:
  selector:
    matchLabels:
      app: myapp
  replicas: 2
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp-container
        image: nginx:latest
      nodeSelector:
        type: master
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - myapp
            topologyKey: "kubernetes.io/hostname"
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - etcd
            namespaces:            # the fix.
              - kube-system
            topologyKey: kubernetes.io/hostname
      tolerations:
      - key: "node-role.kubernetes.io/control-plane"
        effect: "NoSchedule"
        operator: "Exists"

After deploying this configuration, the log-collector pods were scheduled as desired.

Hope this help someone and save them their time.

Ain't No Way That's Taint: Resolving PodAffinity Issues in Kubernetes