Day 26. Kubernetes Scheduling
Table of contents
What is Scheduling in Kubernetes
A scheduler watches for newly created Pods that have no Node assigned. For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on. The scheduler reaches this placement decision taking into account the scheduling principles described below.
kube-scheduler
kube-scheduler is the default scheduler for Kubernetes and runs as part of the control plane. kube-scheduler is designed so that, if you want and need to, you can write your own scheduling component and use that instead.
Kube-scheduler selects an optimal node to run newly created or not yet scheduled (unscheduled) pods. Since containers in pods - and pods themselves - can have different requirements, the scheduler filters out any nodes that don't meet a Pod's specific scheduling needs.
In a cluster, Nodes that meet the scheduling requirements for a Pod are called feasible nodes. The scheduler finds feasible Nodes for a Pod and then runs a set of functions to score the feasible Nodes and picks a Node with the highest score among the feasible ones to run the Pod. The scheduler then notifies the API server about this decision in a process called binding.
Node selection in kube-scheduler
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering
Scoring
The filtering step finds the set of Nodes where it's feasible to schedule the Pod.
For example, the PodFitsResources filter checks whether a candidate Node has enough available resources to meet a Pod's specific resource requests. After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn't (yet) schedulable.
In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules.
There are two supported ways to configure the filtering and scoring behavior of the scheduler:
Scheduling Policies allow you to configure Predicates for filtering and Priorities for scoring.
Scheduling Profiles allow you to configure Plugins that implement different scheduling stages, including:
QueueSort
,Filter
,Score
,Bind
Reserve
,Permit
, and others. You can also configure the kube-scheduler to run different profiles.
Assigning Pods to Nodes
You can constrain a Pod so that it is restricted to run on particular node(s), or to prefer to run on particular nodes. There are several ways to do this and the recommended approaches all use label selectors to facilitate the selection.
You can use any of the following methods to choose where Kubernetes schedules specific Pods:
nodeSelector field matching against node labels
Affinity and anti-affinity
nodeName field
Pod topology spread constraints
Node labels
Like many other Kubernetes objects, nodes have labels. You can attach labels manually. Kubernetes also populates a standard set of labels on all nodes in a cluster.
Labels and Selectors
Labels and Selectors are standard methods to group things together.
Labels are properties attached to each item.
Selectors help you to filter these items
How do you specify labels?
apiVersion: v1
kind: Pod
metadata:
name: simple-webapp
labels:
app: App1
function: Front-end
spec:
containers:
- name: simple-webapp
image: simple-webapp
ports:
- containerPort: 8080
Once the pod is created, to select the pod with labels run the below command
kubectl get pods --selector app=App1
Node isolation/restriction
The NodeRestriction
admission plugin prevents the kubelet from setting or modifying labels with a node-restriction.kubernetes.io/
prefix.
To make use of that label prefix for node isolation:
Ensure you are using the Node authorizer and have enabled the
NodeRestriction
admission plugin.Add labels with the
node-restriction.kubernetes.io/
prefix to your nodes, and use those labels in your node selectors. For example,example.com.node-restriction.kubernetes.io/fips=true
orexample.com.node-restriction.kubernetes.io/pci-dss=true
.
nodeSelector
nodeSelector
is the simplest recommended form of node selection constraint. You can add the nodeSelector
field to your Pod specification and specify the node labels you want the target node to have.
We add new property called Node Selector to the spec section and specify the label.
The scheduler uses these labels to match and identify the right node to place the pods on.
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
spec:
containers:
- name: data-processor
image: data-processor
nodeSelector:
size: Large
label nodes
kubectl label nodes <node-name> <label-key>=<label-value>
To create a pod definition
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
spec:
containers:
- name: data-processor
image: data-processor
nodeSelector:
size: Large
kubectl create -f pod-definition.yml
Node affinity
Node affinity is conceptually similar to nodeSelector
, allowing you to constrain which nodes your Pod can be scheduled on based on node labels. There are two types of node affinity:
requiredDuringSchedulingIgnoredDuringExecution
: The scheduler can't schedule the Pod unless the rule is met. This functions likenodeSelector
, but with a more expressive syntax.preferredDuringSchedulingIgnoredDuringExecution
: The scheduler tries to find a node that meets the rule. If a matching node is not available, the scheduler still schedules the Pod.
Node affinity weight
You can specify a weight
between 1 and 100 for each instance of the preferredDuringSchedulingIgnoredDuringExecution
affinity type. When the scheduler finds nodes that meet all the other scheduling requirements of the Pod, the scheduler iterates through every preferred rule that the node satisfies and adds the value of the weight
for that expression to a sum.
Taints and Tolerations
Taints and Tolerations are used to set restrictions on what pods can be scheduled on a node.
- Only pods which are tolerant to the particular taint on a node will get scheduled on that node.
Taints
Use kubectl taint nodes
command to taint a node.
Syntax
$ kubectl taint nodes <node-name> key=value:taint-effect
Example
$ kubectl taint nodes node1 app=blue:NoSchedule
The taint effect defines what would happen to the pods if they do not tolerate the taint.
There are 3 taint effects
NoSchedule
PreferNoSchedule
NoExecute
Tolerations
Tolerations are added to pods by adding a tolerations
section in pod definition.
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
spec:
containers:
- name: nginx-container
image: nginx
tolerations:
- key: "app"
operator: "Equal"
value: "blue"
effect: "NoSchedule"
Resource Limits
Let us take a look at 3 node kubernetes cluster.
Each node has a set of CPU, Memory and Disk resources available.
If there is no sufficient resources available on any of the nodes, kubernetes holds the scheduling the pod. You will see the pod in pending state. If you look at the events, you will see the reason as insufficient CPU.
We can specify resources and resource limits in pos definition file
apiVersion: v1
kind: Pod
metadata:
name: simple-webapp-color
labels:
name: simple-webapp-color
spec:
containers:
- name: simple-webapp-color
image: simple-webapp-color
ports:
- containerPort: 8080
resources:
requests:
memory: "1Gi"
cpu: "1"
limits:
memory: "2Gi"
cpu: "2"
Daemonsets
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
For DaemonSets, we start with apiVersion, kind as DaemonSets
instead of ReplicaSet
, metadata and spec.
To build it as a DaemonSet, execute the following code block:
cat daemonset-free.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ram-check
spec:
selector:
matchLabels:
name: checkRam
template:
metadata:
labels:
name: checkRam
spec:
containers:
- name: ubuntu-free
image: ubuntu
command: ["/bin/bash","-c","while true; do free; sleep
30; done"]
restartPolicy: Always
To create a daemonset from a definition file
kubectl create -f daemonset-free.yaml
View DaemonSets
To list daemonsets
kubectl get daemonsets
For more details of the daemonsets
kubectl describe ram-check
Static Pods
Static Pods are managed directly by the kubelet daemon on a specific node, without the API server observing them. Unlike Pods that are managed by the control plane (for example, a Deployment); instead, the kubelet watches each static Pod (and restarts it if it fails).
Create a static pod
You can configure a static Pod with either a file system hosted configuration file or a web hosted configuration file.
Filesystem-hosted static Pod manifest
Choose a node where you want to run the static Pod. In this example, it's
my-node1
.ssh my-node1
Choose a directory, say
/etc/kubernetes/manifests
and place a web server Pod definition there, for example/etc/kubernetes/manifests/static-web.yaml
:## Run this command on the node where kubelet is running mkdir -p /etc/kubernetes/manifests/ cat <<EOF >/etc/kubernetes/manifests/static-web.yaml apiVersion: v1 kind: Pod metadata: name: static-web labels: role: myrole spec: containers: - name: web image: nginx ports: - name: web containerPort: 80 protocol: TCP EOF
Configure the kubelet on that node to set a
staticPodPath
value in the kubelet configuration file.Restart the kubelet. On Fedora, you would run:
## Run this command on the node where the kubelet is running systemctl restart kubelet
Web-hosted static pod manifest
Create a YAML file and store it on a web server so that you can pass the URL of that file to the kubelet.
apiVersion: v1 kind: Pod metadata: name: static-web labels: role: myrole spec: containers: - name: web image: nginx ports: - name: web containerPort: 80 protocol: TCP
Configure the kubelet on your selected node to use this web manifest by running it with
--manifest-url=<manifest-url>
. On Fedora, edit/etc/kubernetes/kubelet
to include this line:KUBELET_ARGS="--cluster-dns=10.254.0.10 --cluster-domain=kube.local --manifest-url=<manifest-url>"
Restart the kubelet. On Fedora, you would run:
## Run this command on the node where the kubelet is running systemctl restart kubelet
Subscribe to my newsletter
Read articles from Ashvini Mahajan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Ashvini Mahajan
Ashvini Mahajan
I am DevOps enthusiast and looking for opportunities in DevOps.