Master Kubernetes Pod Scheduling: Taints, Tolerations, NodeSelector & Affinity Explained

Being the K8s user, we all know that kube-scheduler is the component in the control plane that is responsible for scheduling the pods/workloads in the nodes inside kubernetes cluster. By default, the kube-scheduler checks the resource availability in the worker nodes and schedules the pods on any of the qualified node.
But, sometimes we need to schedule the pods on the specific nodes based on our requirement like running the ML workloads on the GPU-based nodes, database pods on the high IOPS supported disk, etc. and many more. So, there might be the situation where we need to:
i. Restrict the pods to be scheduled on certain nodes.
ii. Force the pods to run on specific nodes.
iii. Effectively place the pods on different nodes to ensure high availability.
No Worries. K8s provides the feature to achieve our requirements through Taints and Tolerations, NodeSelectors and Affinity.
1. Taints & Tolerations: Prevent Pods from Running on Certain Nodes
This is useful when we want to restrict the pods to be scheduled on the certain nodes unless they have the matching toleration. You might have observed that the pods never get scheduled on the control plane nodes because the K8s control nodes are by default tainted so that the regular workloads don’t get scheduled.
At first, we need to taint the nodes as:
kubectl taint node <node-name> <key>=<value>:<taint-effect>
The available taint-effects are:
i. NoSchedule : It restricts the pod to get scheduled on the tainted nodes unless it has the matching tolerations. So, the pod will remain in pending state if any matching node is not found.
ii. PreferNoSchedule: It tries not to schedule the pod on the tainted node. But, if any of the matching nodes are not available, the pod gets scheduled on any of the node. So, the pod won’t be in the pending state.
iii. NoExecute: It terminates the pods running on the tainted node immediately if the pods don’t have any matching tolerations unlike the other two which doesn’t check the taints for the running pods.
Here’s the practical example of using the taints and tolerations.I have created the deployment that creates 3 replicas of busybox
pods.
Deployment.yaml file
apiVersion: apps/v1
kind: Deployment
metadata:
name: pod-scheduler
spec:
replicas: 3
selector:
matchLabels:
app: test
template:
metadata:
labels:
app: test
spec:
containers:
- name: busybox
image: busybox
resources:
limits:
cpu: "100m"
memory: "64Mi"
command: ["sleep", "3600"]
When I apply this deployment without any conditions then the kube-scheduler
distributes the pods on all the nodes i.e.
Now, I will taint the minikube-m03
node and let’s see the result what happens.
The taint effect NoExecute
on the node minikube-m03
removed the running pod on that node cause pod hasn’t any matching toleration for that taint and the new pod was created on the other node because the deployment controller ensures that the desired state always match will the actual state of the cluster.
Now, if I add the following toleration in the spec.template.spec
.. section of the above deployment, then the pods with this toleration will be allowed to be scheduled on the tainted node i.e. minikube-m03
but it’s not guaranteed that those pods will be scheduled on that tainted node.
tolerations:
- key: gpu
value: "true"
effect: NoExecute
operator: Equal
In order to remove the taint on the node, the command will be:
kubectl taint node <node-name> <key>=<value>:<taint-effect>-
Only change will be adding hyphen(-) at the end. Keep in the mind that, NoExecute
taint effect is for checking the running pods.
Now, what if we want the pods to be scheduled on the specific node or the group of nodes ? Here comes the nodeSelector
and Node Affinity
in action.
2. NodeSelector & Node Affinity: Which One to Use?
Node Selectors
Node Selectors and affinity are the ways to force the pod to be scheduled on the particular matching node only based on the node labels. nodeSelector
is the simple way to do this. For this, at first, we need to label the nodes with key-value pair as:
kubectl label node <node-name> <key>=<value>
I have also updated deployment in the spec.template.spec
.. section with:
nodeSelector:
performance: low
After making the changes and applying the deployment config file. Here’s what happens.
All the pods are scheduled on that labelled node i.e. minikube-m02
. But, if the nodeSelector
label doesn’t get matched with the label of any nodes, then the pod will remain in the pending state. Let’s see the pending state as well by removing the label from the node (minikube-m02
) as ( adding hypen (-) at the end ):
kubectl label node minikube-m02 performance-
We can see that the pods are in pending state. So, nodeSelector
is strict in nature. But, sometimes we want more flexibility like as using the taint effect PreferNoSchedule
. Yup, we have Node Affinity
that provide us this flexibility. Let’s have a quick look at it.
Node Affinity
There are two types of node affinity that work same as the taint effects in taints and tolerations.
i. requiredDuringSchedulingIgnoredDuringExecution: It’s a strict type means there should be matching node labels for the pod to be scheduled otherwise the pod remains in pending state forever.
ii. preferredDuringSchedulingIgnoredDuringExecution: It tries to schedule the pod in the node that has matching labels. But, if no node exists with that label/s then the pod will be scheduled in any of the nodes based on the default algorithm of kube-scheduler
.
Are you also thinking about the suffix of the above effects IgnoredDuringExecution
? It means that these rules apply only at scheduling time. The phrase says it all , node affinity doesn’t have control over the already running pods and any changes to node labels will not affect already scheduled pods. They will be in the running state as they were earlier. Let’s have a practical demonstration.
Already, we have the previous label on minikube-m02
as:
kubectl label node minikube-m02 performance=low
Let’s add the following affinity to the spec.template.spec
.. section of the deployment.yml
file. And, remove the previous nodeSelector
field in spec.template.spec
.. section.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: performance
operator: In
values:
- low
We can see we have more flexibility while matching the labels. Multiple key-value pairs can be used, multiple values for a single key can also be used. Additionally, different operators like ( In
, Exists
, NotIn
, DoesNotExist
), etc. can also be used. Check out more about operators
here:
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#operators
See all the pods are scheduled in the minikube-m02
. It works same as nodeSelector
but with more flexibility.
Now, let’s have a look at the PreferNoSchedule
affinity.
Add this section in the spec.template.spec
… section of the deployment.yaml
file and remove the node-label of minikube-m02
.
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 5
preference:
matchExpressions:
- key: performance
operator: In
values:
- low
Importance of weight field
The weight
field acts as a priority value assigned to a node when it matches the specified preference. Each matching expression can have a different weight. If multiple nodes satisfy the expression, then the kube-scheduler
sums the weights of all matching expressions for each node and selects the node with the highest total weight for scheduling the pod. For example, consider two matching expressions:
Expression_1 with a weight of 10
Expression_2 with a weight of 20
Now, assume:
Node_1 satisfies both Expression_1 and Expression_2
Node_2 satisfies only Expression_2
The total weights would be:
Node_1:
10 (Expression_1) + 20 (Expression_2) = 30
Node_2:
20 (Expression_2) = 20
So, the scheduler schedules the pod in Node_1.
Let’s see the cluster status after deploying the updated deployment.yml
file:
The pods didn’t have the matching labels with any of the nodes. So they are scheduled by the default algorithm of kube-scheduler
on the nodes. If we have used the requiredDuringSchedulingIgnoredDuringExecution
instead of preferredDuringSchedulingIgnoredDuringExecution
then all the pods will go in pending state.
If I wrap the things up then I would say nodeAffinity
provides more flexible control over pod scheduling but is more complex to use. For strict matching of labels, it’s better to use nodeSelector
or even the nodeName
field that directly schedules the pod in the specific single node with the matching name.
Check out more on nodeAffinity
here:
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity
Check out more on nodeName
here:
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename
Key Points to Remember:
i. Taints
are used to restrict the pods to be scheduled on the nodes and tolerations
helps the pod to overcome the matching taint and get schedule on that node.
ii. Even if the pod has matching toleration, then it’s not guaranteed that the pod will be scheduled on the tainted node.
iii. nodeSelector
, nodeAffinity
and nodeName
are the ways to force the pod to be scheduled on the particular node.
If you want to go deeper into the pod scheduling on Kubernetes, then check this link below:
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
Connect with me on:
🔗 LinkedIn: linkedin.com/in/anjal-poudel-8053a62b8
Subscribe to my newsletter
Read articles from Anjal Poudel directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
