Day 26. Kubernetes Scheduling

Ashvini MahajanAshvini Mahajan
8 min read

What is Scheduling in Kubernetes

A scheduler watches for newly created Pods that have no Node assigned. For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on. The scheduler reaches this placement decision taking into account the scheduling principles described below.

kube-scheduler

kube-scheduler is the default scheduler for Kubernetes and runs as part of the control plane. kube-scheduler is designed so that, if you want and need to, you can write your own scheduling component and use that instead.

Kube-scheduler selects an optimal node to run newly created or not yet scheduled (unscheduled) pods. Since containers in pods - and pods themselves - can have different requirements, the scheduler filters out any nodes that don't meet a Pod's specific scheduling needs.

In a cluster, Nodes that meet the scheduling requirements for a Pod are called feasible nodes. The scheduler finds feasible Nodes for a Pod and then runs a set of functions to score the feasible Nodes and picks a Node with the highest score among the feasible ones to run the Pod. The scheduler then notifies the API server about this decision in a process called binding.

Node selection in kube-scheduler

kube-scheduler selects a node for the pod in a 2-step operation:

  1. Filtering

  2. Scoring

The filtering step finds the set of Nodes where it's feasible to schedule the Pod.

For example, the PodFitsResources filter checks whether a candidate Node has enough available resources to meet a Pod's specific resource requests. After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn't (yet) schedulable.

In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules.

There are two supported ways to configure the filtering and scoring behavior of the scheduler:

  1. Scheduling Policies allow you to configure Predicates for filtering and Priorities for scoring.

  2. Scheduling Profiles allow you to configure Plugins that implement different scheduling stages, including: QueueSort, Filter, Score,Bind Reserve, Permit, and others. You can also configure the kube-scheduler to run different profiles.

Assigning Pods to Nodes

You can constrain a Pod so that it is restricted to run on particular node(s), or to prefer to run on particular nodes. There are several ways to do this and the recommended approaches all use label selectors to facilitate the selection.

You can use any of the following methods to choose where Kubernetes schedules specific Pods:

  • nodeSelector field matching against node labels

  • Affinity and anti-affinity

  • nodeName field

  • Pod topology spread constraints

Node labels

Like many other Kubernetes objects, nodes have labels. You can attach labels manually. Kubernetes also populates a standard set of labels on all nodes in a cluster.

Labels and Selectors

Labels and Selectors are standard methods to group things together.

Labels are properties attached to each item.

Selectors help you to filter these items

How do you specify labels?

 apiVersion: v1
 kind: Pod
 metadata:
  name: simple-webapp
  labels:
    app: App1
    function: Front-end
 spec:
  containers:
  - name: simple-webapp
    image: simple-webapp
    ports:
    - containerPort: 8080

Once the pod is created, to select the pod with labels run the below command

kubectl get pods --selector app=App1

Node isolation/restriction

The NodeRestriction admission plugin prevents the kubelet from setting or modifying labels with a node-restriction.kubernetes.io/ prefix.

To make use of that label prefix for node isolation:

  1. Ensure you are using the Node authorizer and have enabled the NodeRestriction admission plugin.

  2. Add labels with the node-restriction.kubernetes.io/ prefix to your nodes, and use those labels in your node selectors. For example, example.com.node-restriction.kubernetes.io/fips=true or example.com.node-restriction.kubernetes.io/pci-dss=true.

nodeSelector

nodeSelector is the simplest recommended form of node selection constraint. You can add the nodeSelector field to your Pod specification and specify the node labels you want the target node to have.

We add new property called Node Selector to the spec section and specify the label.

The scheduler uses these labels to match and identify the right node to place the pods on.

apiVersion: v1
kind: Pod
metadata:
 name: myapp-pod
spec:
 containers:
 - name: data-processor
   image: data-processor
 nodeSelector:
  size: Large

label nodes

kubectl label nodes <node-name> <label-key>=<label-value>

ln

To create a pod definition

apiVersion: v1
kind: Pod
metadata:
 name: myapp-pod
spec:
 containers:
 - name: data-processor
   image: data-processor
 nodeSelector:
  size: Large
kubectl create -f pod-definition.yml

Node affinity

Node affinity is conceptually similar to nodeSelector, allowing you to constrain which nodes your Pod can be scheduled on based on node labels. There are two types of node affinity:

  • requiredDuringSchedulingIgnoredDuringExecution: The scheduler can't schedule the Pod unless the rule is met. This functions like nodeSelector, but with a more expressive syntax.

  • preferredDuringSchedulingIgnoredDuringExecution: The scheduler tries to find a node that meets the rule. If a matching node is not available, the scheduler still schedules the Pod.

Node affinity weight

You can specify a weight between 1 and 100 for each instance of the preferredDuringSchedulingIgnoredDuringExecution affinity type. When the scheduler finds nodes that meet all the other scheduling requirements of the Pod, the scheduler iterates through every preferred rule that the node satisfies and adds the value of the weight for that expression to a sum.

Taints and Tolerations

Taints and Tolerations are used to set restrictions on what pods can be scheduled on a node.

  • Only pods which are tolerant to the particular taint on a node will get scheduled on that node.

Taints

Use kubectl taint nodes command to taint a node.

Syntax

$ kubectl taint nodes <node-name> key=value:taint-effect

Example

$ kubectl taint nodes node1 app=blue:NoSchedule
  • The taint effect defines what would happen to the pods if they do not tolerate the taint.

  • There are 3 taint effects

    • NoSchedule

    • PreferNoSchedule

    • NoExecute

Tolerations

Tolerations are added to pods by adding a tolerations section in pod definition.

apiVersion: v1
kind: Pod
metadata:
 name: myapp-pod
spec:
 containers:
 - name: nginx-container
   image: nginx
 tolerations:
 - key: "app"
   operator: "Equal"
   value: "blue"
   effect: "NoSchedule"

Resource Limits

Let us take a look at 3 node kubernetes cluster.

  • Each node has a set of CPU, Memory and Disk resources available.

  • If there is no sufficient resources available on any of the nodes, kubernetes holds the scheduling the pod. You will see the pod in pending state. If you look at the events, you will see the reason as insufficient CPU.

  • We can specify resources and resource limits in pos definition file

apiVersion: v1
kind: Pod
metadata:
  name: simple-webapp-color
  labels:
    name: simple-webapp-color
spec:
 containers:
 - name: simple-webapp-color
   image: simple-webapp-color
   ports:
    - containerPort:  8080
   resources:
     requests:
      memory: "1Gi"
      cpu: "1"
     limits:
       memory: "2Gi"
       cpu: "2"

Daemonsets

A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.

For DaemonSets, we start with apiVersion, kind as DaemonSets instead of ReplicaSet, metadata and spec.

To build it as a DaemonSet, execute the following code block:

cat daemonset-free.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
    name: ram-check
spec:
    selector:
    matchLabels:
        name: checkRam
    template:
    metadata:
    labels:
        name: checkRam
    spec:
    containers:
        - name: ubuntu-free
        image: ubuntu
        command: ["/bin/bash","-c","while true; do free; sleep
        30; done"]
        restartPolicy: Always

To create a daemonset from a definition file

kubectl create -f daemonset-free.yaml

View DaemonSets

To list daemonsets

kubectl get daemonsets

For more details of the daemonsets

kubectl describe ram-check

Static Pods

Static Pods are managed directly by the kubelet daemon on a specific node, without the API server observing them. Unlike Pods that are managed by the control plane (for example, a Deployment); instead, the kubelet watches each static Pod (and restarts it if it fails).

Create a static pod

You can configure a static Pod with either a file system hosted configuration file or a web hosted configuration file.

Filesystem-hosted static Pod manifest

  1. Choose a node where you want to run the static Pod. In this example, it's my-node1.

     ssh my-node1
    
  2. Choose a directory, say /etc/kubernetes/manifests and place a web server Pod definition there, for example /etc/kubernetes/manifests/static-web.yaml:

     ## Run this command on the node where kubelet is running
     mkdir -p /etc/kubernetes/manifests/
     cat <<EOF >/etc/kubernetes/manifests/static-web.yaml
     apiVersion: v1
     kind: Pod
     metadata:
       name: static-web
       labels:
         role: myrole
     spec:
       containers:
         - name: web
           image: nginx
           ports:
             - name: web
               containerPort: 80
               protocol: TCP
     EOF
    
  3. Configure the kubelet on that node to set a staticPodPath value in the kubelet configuration file.

  4. Restart the kubelet. On Fedora, you would run:

     ## Run this command on the node where the kubelet is running
     systemctl restart kubelet
    

Web-hosted static pod manifest

  1. Create a YAML file and store it on a web server so that you can pass the URL of that file to the kubelet.

     apiVersion: v1
     kind: Pod
     metadata:
       name: static-web
       labels:
         role: myrole
     spec:
       containers:
         - name: web
           image: nginx
           ports:
             - name: web
               containerPort: 80
               protocol: TCP
    
  2. Configure the kubelet on your selected node to use this web manifest by running it with --manifest-url=<manifest-url>. On Fedora, edit /etc/kubernetes/kubelet to include this line:

     KUBELET_ARGS="--cluster-dns=10.254.0.10 --cluster-domain=kube.local --manifest-url=<manifest-url>"
    
  3. Restart the kubelet. On Fedora, you would run:

     ## Run this command on the node where the kubelet is running
     systemctl restart kubelet
    
0
Subscribe to my newsletter

Read articles from Ashvini Mahajan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ashvini Mahajan
Ashvini Mahajan

I am DevOps enthusiast and looking for opportunities in DevOps.