Kubernetes Fundamentals: Week 2 – Scheduling, DaemonSets, Logging & Mo

Hi, This week was quite full of surprises, so I couldn’t learn much core concepts, but topics that help manage Kubernetes resources

Scheduling

Manual Scheduling

Every pod definition has a field nodeName → which by default is not set, Kubernetes adds the automatically, you don’t need to set it up

Scheduler looks through all the pods for those that do not have this property set → those are the candidates for scheduling
Once identified, it schedules the pod on the node by setting the nodename property → to the name of the node by creating a binding object
What happens when there is no Scheduler to monitor and schedule the pods
- pods continue to be in a pending state
- You can manually set nodeName field to schedule the pod on the node
- You can only specify a nodeName at the creation of a pod
- What if the pod is already created and you want to assign a different node?
  - K8s won’t allow you to modify the nodeName property of the pod
  - Another method is to create a pod-bind-definition object and request to change the nodeName field
  - pod-bind-defintion.yml
```
  apiVersion: v1
  kind: Binding
  metadata:
      name: nginx
  target:
      apiVersion: v1
      kind: Node
      name: <nodeName>
```
  - Create the JSON equivalent of this file and send the request like this

curl --header "Content-Type:application/json" --request POST --data '{"apiVersion":"v1",.....} <http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding/\>

Labels and selectors

When creating objects in k8s, we might end up with 100s of objects, so it becomes necessary to filter them

Either by type, application, or functionality

You can group and select objects via Labels and Selectors
For each object, attach labels as per your needs in the metadata field
While selecting, specify a condition to filter specific objects

pod-definition.yaml

  apiVersion: v1
  kind: Pod
  metadata:
      name: simple-webapp
      labels:
          app: App1
          function: Front-end
  spec:
      containers:
      - name: simple-webapp
          image: simple-webapp
          ports:
              - containerPort:8080

# to select pod via labels
kubectl get pods --selector app=App1

K8s objects use labels and selectors internally to connect different objects

E.g.: In Replicaset

replicaset.yaml

  apiVersion: apps/v1
  kind: ReplicaSet
  metadata:
      name: myapp-replicaset
      labels:
          app: myapp
          type: front-end
  spec:
      template:
          metadata:
              name: myapp-pod
              labels:
                  app: myapp
                  type: front-pod
          spec:
              containers:
                  - name: nginx-container
                      image: nginx
              replicas: 3
              selector: # help to check what pods are under it as it can also take pods that are not created by this yaml file
                  matchLabels:
                      type: front-pod

Annotations
- These are used to record other details for informational purposes

Management of K8s Objects

K8s objects should be managed using only one technique. Mixing and matching techniques for the same object result in undefined

Management technique	Operates on	Recommended environment	Supported writers	Learning curve
Imperative commands	Live objects	Development projects	1+	Lowest
Imperative object configuration	Individual files	Production projects	1	Moderate
Declarative object configuration	Directories of files	Production projects	1+	Highest

Imperative Command
- A user operates directly on live objects in a cluster
- The user provides operations to kubectl command as arguments or flags
- Advantage
  - commands are expressed as a single action word
  - Commands required only a single step to make changes to the cluster
- Disadvantages
  - Commands do not integrate with the change review process
  - Commands do not provide an audit trail associated with changes
  - Commands do not provide a source of records except for what is live
  - Commands do not provide a template for creating new objects
Imperative Object Config
- Here kubectl command specifies the operation, an optional flag, and at least one file
- The file specified must contain a full definition of the object in YAML or JSON format

💡

The imperative replace command replaces the existing spec with the newly provided one, dropping all changes to the object missing from the configuration file. This approach should not be used with resource types whose specs are updated independently of the configuration file. Services of type LoadBalancer, for example, have their externalIPs The field is updated independently of the configuration by the cluster.

Eg: kubectl create -f nginx.yaml
Advantages
- Object configuration can be stored in a source control system such as Git.
- Object configuration can integrate with processes such as reviewing changes before push and audit trails.
- Object configuration provides a template for creating new objects.
Disadvantages
- Object configuration requires a basic understanding of the object schema.
- Object configuration requires the additional step of writing a YAML file.

Declarative Object Config
- user operates on object config files stored locally.
- Create, update, and delete operations are automatically detected • user doesn’t specify

💡

Declarative object configuration retains changes made by other writers, even if the changes are not merged back to the object configuration file. This is possible by using the patch API operation to write only observed differences, instead of using the replace API operation to replace the entire object configuration.

Eg:
```
  kubectl diff -R  -f configs/
  kubectl apply -R  -f configs/
  # -R -> recursively process directories
```
- Advantage
  - Changes made directly to live objects are retained, even if they are not merged back into the configuration files.
  - Declarative object configuration has better support for operating on directories and automatically detecting operation types (create, patch, delete) per-object.
- Disadvantage
  - Declarative object configuration is harder to debug and understand results when they are unexpected.
  - Partial updates using diffs create complex merge and patch operations

Taints and Tolerations

These are used as a restriction on what pods can be scheduled on a node

When pods are created, the K8s scheduler tries to place these pods on the available worker node
If we have created a specific resource available on Node 1 out of 3 for specific pods → We will first add a Taint (Eg: Taint = blue ) on our node
- When means unless specified other wise none of the pods can tolerate the taint → and won’t be placed on node 1
To allow let’s sa,y pod D (from A, B, C, D) to be placed on Node 1 → we will apply toleration on the pods D → D is now Tolerant to the Taint (blue)
Now, when the scheduler tries to place the pod D on node 1 → it can go through
How to set these?

Syntax: kubectl taint nodes node-name key=value:taint-effect
- the taint-effect → defined what would happen to the pod if they do not tolerate the taint
- There are 3 taintEffects
  - NoSchedule → Strict: No new pods are scheduled on the node unless they have a matching toleration.
    - Existing pods are not affected.
  - PreferNoSchedule → Soft: Kubernetes tries to avoid scheduling pods on the node, but it's not enforced.
  - NoExecute → Evicts existing pods that don’t tolerate the taint.
    - Also prevents new pods from scheduling unless they tolerate it.

NoExecute behavior with tolerationSeconds

`tolerationSeconds`	Behavior
Not set	Pod stays on the node forever.
Set to number	Pod is allowed to stay for that number of seconds, then it's evicted.
No matching toleration	Pod is evicted immediately.

eg: kubectl taint nodes node1 app=blue:NoSchedule

Tolerance to pod pod-definition.yml

  apiVersion: v1
  kind: pod
  metadata:
      name: myapp-pod
  spec:
      containers:
      - name: nginx-container
          image: nginx
      tolerations
      - key: app
          operator: "Equal"
          value: "blue"
          effct: "NoSchedule"

Of course, this doesn’t guarantee certain pods to schedule on certain nodes
Scheduler doesn’t schedule any pod on the Master node → cause of a taint applied to it
- kubectl describe node <node-name> | grep Taint

When you define a tolerance, you can use an operator:

The two possible operators are:
- Equal (default)
- Exists

If you don’t specify the operator explicitly, it defaults to Equal.

A toleration matches a taint if:
1. The key is the same
2. The effect is the same (e.g., NoSchedule, PreferNoSchedule, or NoExecute)
3. One of the following is true:
  - The operator is Exists → in this case, no value should be specified.
  - The operator is Equal (default) → and the values must match.

If the toleration has an empty key:
- The operator must be Exists.
- This means it can match taints with any key (wildcard behavior).
- But the effect still needs to match.
If the toleration has an empty effect:
- It can match taints with any effect, but only those with the matching key.

The default Kubernetes scheduler takes taints and tolerations into account when selecting a node to run a particular Pod. However, if you manually specify the .spec.nodeName For a Pod, that action bypasses the scheduler; the Pod is then bound onto the node where you assigned it, even if there are NoSchedule taints on that node that you selected
- The same thing happened with NoExecute taint, the kubelet will eject the Pod unless there is an appropriate tolerance set

You can put multiple taints on the same node and multiple tolerations on the same pod
- k8s processes first with all nodes’ taints to filter, then ignores the ones for whichthe pod has a matching toleration
The built-in taints
- node.kubernetes.io/not-ready → Node not ready → corresponds to the NodeCondition Ready being false
- node.kubernetes.io/unreachable → Node is unreachable → NodeCondition Ready being Unknown
- node.kubernetes.io/memory-pressure → Node has memory pressure
- node.kubernetes.io/disk-pressure: Node has disk pressure.
- node.kubernetes.io/pid-pressure: Node has PID pressure.
- node.kubernetes.io/network-unavailable: Node's network is unavailable.
- node.kubernetes.io/unschedulable: Node is unschedulable.
- node.cloudprovider.kubernetes.io/uninitialized: When the kubelet is started with an "external" cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.
DaemonSet pods are created with NoExecute tolerations for the following taints with no tolerationSeconds:
- node.kubernetes.io/unreachable
- node.kubernetes.io/not-ready

This ensures that DaemonSet pods are never evicted due to these problems.

The DaemonSet controller automatically adds the following NoSchedule tolerations to all daemons to prevent DaemonSets from breaking.
- node.kubernetes.io/memory-pressure
- node.kubernetes.io/disk-pressure
- node.kubernetes.io/pid-pressure (1.14 or later)
- node.kubernetes.io/unschedulable (1.10 or later)
- node.kubernetes.io/network-unavailable (host network only)
The scheduler checks taints, not node conditions, when it makes scheduling decisions. This ensures that node conditions don't directly affect scheduling.
Sometimes, only one device on a node is faulty or under maintenance.
Tainting the whole node would block all pods, even those not using the bad device.
By tainting just the device, you:
- Avoid disrupting other workloads.
- Target only the pods that use the affected device.

NodeSelector

To limit a pod to run on a particular node based on certain labels, we use nodeSelector

pod-definition.yml

  apiVersion:
  kind: Pod
  metadata:
      name: myapp-pod
  spec:
      containers:
      - name: data-processor
          image: data-processor
      nodeSelector:
          size: large

You cannot provide advanced operations like NOT or OR with nodeselector

NodeAffinity

primary function → ensure pods are hosted or scheduled on a particular node
It also provides advanced capabilities for operations

apiVersion:
kind:
metadata:
    name: myapp-pod
spec:
    containers:
    - name: data-processor
        image: data-processor
    affinity:
        nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
                - key: size
                    operator: In
                    values:
                    - Large
                    - Medium

NodeAffinity Types
- The type of node affinity defines the behavior of a scheduler with respect to NodeAffinity & stages in the lifecycle of a Pod
  1. requierdDuringSchedulingIgnoredDuringExecution
  2. preferredDuringSchedulingIgnoredDurngExecution
- There are 2 stages in the lifecycle of a pod when considering Affinity
  - During Scheduling → a state where the pods do not exist and is created for the first time → when the pod is first created, the Affinity rules are required to place pods on the right node
    - required→ if the said label is not found on a node, this pod will not schedule
    - preferred → if the said label is not found on the node → the scheduler will ignore Affinity rules and place the pod on any available nodes
  - During execution → a pod has been running and a change has been made that affects the node affinity
    - ignored → if label is removed from the node → the pods running on the node will continue to run and any changes in node affinity will not impact them once schedules
    - required → the pod will be removed or terminated if the label is removed from the node

Resource Limits

a three-node Kubernetes cluster, each node has a set of CPU and MEM resources available
Every pod requires a set of CPU and MEM to run
Whenever a pod is placed on a node, it consumes the resources of the node
As the Scheduler, schedules the pod on the node, it looks at the amount of resources required by the pod and those available on the node to identify the best node to place the pod on
If the node doesn’t have sufficient resources available, the scheduler tries to avoid scheduling a pod on the node and instead places the pod on a node with sufficient resources available
If no node has sufficient resources, then the pod will be set in the Pending state (event: Insufficient cpu)
Resource Request→ the minimum amount of CPU and memory required by the pod to schedule the pod on the node
- To do this, add this to your pod definition under spec.containers

    resources:
        requests:
            memory: "4Gi"
            cpu: 2

Here CPU 1→ stands for 1 vCPU in AWS or 1 Core in GCP or Azure or 1 Hyperthread (lowest is 1m or 0.1)
for memory the lowest you can go to s 268M 268 Mi
- 1 G → Gigabyte → 1,000000000 byte
- 1 M → Megabyte → 1,000000 byte
- 1 K → Kilobyte → 1000 byte
- 1 Gi → Gibibyte → 1073741824 byte
- 1 Mi → 1048576 byte
- 1 Ki → 11024 byte
Resource Limits → to set the limit of the resource consumption by a pod
- To do this, add a limit section to the resource block

    resources:
        limits:
            memory: "2Gi"
            cpu: 2

Request and limit are set for the container in a pod
The system throttles the CPU so that it doesn’t go beyond the specified limit
- This is not the case with memory → a container can use more memory than its limit, if this is done constantly, the pod will be terminated → with error OOM
Behaviour
- CPU
  1. Without a resource or limit set, one pod can consume all the CPU and MEM resources and prevent other pods the required resources
  2. In case you don’t have a request specified, but limits → k8s will automatically set the request same as limits
  3. In case where request and limit both are set, each pod gets its requested resource and goes up to limit → most ideal, but in case where pod required more than limit and other pods are not consuming much resources, the pod cannot exceed its limit
  4. Good case → Request set with no limit so that there is no upper limit to use resource → for this to work, make sure all pods have a request set
- Memory
  1. No Request, No limit set → one pod can eat up all memory
  2. No Request, but limit set → request=limit → pods get resources up to limit and no more than that
  3. Request, Limit set → each pod gets the requested resource, but can go up to the limit
  4. Request but no limit set → each pod gets a guaranteed resource but with no upper limit in case needed more → in case one pod took the entire memory and another pod requires it, the only option left is to kill the 1st pod (as not throttle like CPU)

How do we ensure that every pod created has a default set?

to define default ranges for the limit in containers for the pod

If you create or change the limit range → it will not be affected in current pods but only for the upcoming ones

limit-range-cpu.yaml

  apiVersion: v1
  kind: LimitRange
  metadata:
      name: cpu-resource-constraint
  spec:
      limits:
      - default:
              cpu: 500m
          defaultRequest:
              cpu: 500m
          max:
              cpu: "1"
          min:
              cpu: 100m
          type: Container

limit-range-memory.yaml

  apiVersion: v1
  kind: LimitRange
  metadata:
      name: memory-resource-constraint
  spec:
      limits:
      - default:
              memory: 1Gi
          defaultRequest:
              memory: 1Gi
          max:
              memory: 1Gi
          min:
              memory: 500Mi
          type: Container

Any way to restrict the total amount of resources to be consumed by an application on a cluster

using Resource Quotas at the namespace level

to set hard limit for the request and limit

resource-quota.yaml

  apiVersion: v1
  kind: ResourceQuota
  metadata:
      name: my-resource-quota
  spec:
      hard:
          requests:
              cpu: 4
              memory: 4Gi
          limits:
              cpu: 10
              memory: 10Gi

DaemonSets

Daemonsets are like replica sets as they help deploy multiple instances of pods on the nodes → but they run one copy of your pod on each node in the cluster
When a new node is added to the cluster, the daemon set creates a replica of the pod on that node, and when the node is removed, the pod is automatically removed
DaemonSets ensure that one copy of the pod is always present in each node in the cluster
Use case → Monitoring Agent and Logs Viewer, kube-proxy
The Daemon-set definition is exactly as same as ReplicaSet except kind: DaemonSet

kubectl get daemonsets
kubectl describe daemonsets <name>

How does it work?
- After k8s v1.12, it uses the scheduler and affinity rules to schedule pods on nodes
- Before that, it used to set the nodeName property to bypass the scheduler and directly schedule a pod on the nodes

StaticPods

kubelet relies on the kube-api-server for instructions on what pods to load on the node, which is based on a decision made by kube-scheduler → in the etcd data store

What if there are no other components than a worker node, not part of any cluster
The kubelet can manage a node independently (we have kubelet and Docker installed)
- kubelet knows how to create pods, but no API server
- how do you provide a pod-definition file to kubelet without a kube-api-server?
  - You can configure the kubelet to read the pod definition file from a directory on the server designated to store info about pods /etc/kubernetes/manifests
  - kubelet periodically checks this directory for files, reads them, and creates pods
  - If the application crashes the Kubelet attempts to restart it
  - If you remove a file from the directory, the pod will be deleted automatically
- kubelet works at the pod level and can only understand pods
- This could be any directory on the host, and the location of that directory is passed in kubelet.service file in option pod-manifest-path
- or in the option config It will have kubeconfig.yaml file in the path is in staticPodPath
The kubectl utility won’t work as we don't have an api server now, and it works with that
You can check your pods via docker ps

Priority Classes

These are non-namespaced objects; they are created outside of a namespace
Once they are created, they are available to be attached to any namespace on any pod
We define priority using a range of numbers (1,000,000,000 to -2,147,483,648) → a larger number indicates higher priority
- This range is for application or workloads that are defined as Apps on the cluster
There is a separate range for internal system critical pods (2,000,000,000 to 1,000,000,000)
kubectl get priorityclass → to list existing priority classes

priority-class.yaml

  apiVersion: scheduling.k8s.io/v1
  kind: PriorityClass
  metadata:
      name: high-priority
  value: 1000000000
  description: "Priority class for mission critical pods"

Once created, we can associate this priority class with a pod by adding priorityClassName: high-priority to spec
If we don’t specify a priorityClassName, then its assumed to be of 0 (by default)
- in order to change the default, just add globalDefault: true in your priorityClass.yml
globalDefault can only be defined in a single priority class in your cluster
Effects of Pod Priority

First, the higher priority pod is given resources, and if some are left, then given to the lower priority one

In case your node doesn’t have any resource left and you have a new higher-priorty job, where do you place it ? → that behaviour is defined in your Priority class definition file under preemptionPolicy
- Its default value is set to PreemtLowerPriority → kill the existing lower-priority job and fill its place
- never → this waits for the resources to free up instead of killing

Multiple Schedulers

When creating a pod or a deployment, you can instruct the K8s cluster to have them scheduled by a specific scheduler

All schedulers must have a different name
Default scheduler is named default-scheduler

scheduler-config.yaml

  apiVersion: kubescheduler.config.k8s.io/v1
  kind: KubeSchedulerConfiguration
  profiles:
  - schedulerNames: my-scheduler

  # if multiple copies of same scheduler are running on different nodes only one can be active at a time
  leaderElection:
      leaderElect: true
      resourceNamespace: kube-system
      resourceName: lock-object-my-scheduler

Deploy Additional Scheduler

wget <kube-scheduler-bin>

my-scheduler.service

  ExecStart=/usr/local/bin/kube-scheduler \\\\
      --config=/etc/kubernetes/config/my-scheduler-config.yaml

Deploy Additional Scheduler as a pod

custom-scheduler.yaml

  apiVersion: v1
  kind: Pod
  metadata:
      name: custom-scheduler
      namespace: kube-system
  spec:
      containers:
      - command:
          - kube-scheduler
          - --address=127.0.0.1
          - --kubeconfig=/etc/kubernetes/scheduler.conf
          - --config=/etc/kubernetes/config/my-scheduler-config.yaml

          image: k8s.gcr.io/kube-scheduler-amd54:v1.11.3
          name: kube-scheduler

to select the pod to deploy in the pod-definition under spec mention schedulerName: my-custom-scheduler
If the scheduler is not configured currently the pod will remain in the Pending state

Configuring Scheduler Profiles

When a pod is defined, it enters a scheduling queue along with other pending pods
- If a pod required 10 CPU, it will only schedule on a node with at least 10 available CPU
pods with higher priority are placed at the beginning of the queue
Scheduling Phases

After being queued, pods progress through several phases:
1. Filter Phase: Nodes that cannot meet the pod's resource requirements (e.g., nodes lacking 10 CPUs) are filtered out.
2. Scoring Phase: Remaining nodes are scored based on resource availability after reserving the required CPU. For example, a node with 6 CPUs left scores higher than one with only 2.
3. Binding Phase: The pod is assigned to the node with the highest score.
Scheduling Plugin
- Priority Sort plugin → sorts pods in scheduling queue according to priority
- Node Resource Fit plugin → filter out nodes that do not have needed resources
- Node Name plugin → Checks for a specific node name in the pod specification and filters nodes accordingly.
- Node Unschedulable plugin → Exclude nodes marked as unschedulable (commands like drain or cordon will set the unschedulable flag)
- Scoring Plugin → During the scoring phase, plugins (such as the Node Resources Fit and Image Locality plugins) assess each node's suitability.
  - They assign scores rather than outright rejecting nodes.
- Default Binder Plugin → Finalizes scheduling process by binding pod to selected node
Rather than running separate scheduler binaries for separate schedulers → k8s 1.8 introduced support for multiple scheduling profiles within a single scheduling binary

Profile Config

  # my-scheduler-2-config.yaml
  apiVersion: kubescheduler.config.k8s.io/v1
  kind: KubeSchedulerConfiguration
  profiles:
    - schedulerName: my-scheduler-2
    - schedulerName: my-scheduler-3

  # my-scheduler-config.yaml
  apiVersion: kubescheduler.config.k8s.io/v1
  kind: KubeSchedulerConfiguration
  profiles:
    - schedulerName: my-scheduler

  # scheduler-config.yaml
  apiVersion: kubescheduler.config.k8s.io/v1
  kind: KubeSchedulerConfiguration
  profiles:

Each profile has many options for enabling and using plugins

  apiVersion: kubescheduler.config.k8s.io/v1
  kind: KubeSchedulerConfiguration
  profiles:
    - schedulerName: my-scheduler-2
      plugins:
        score:
          disabled:
            - name: TaintToleration
          enabled:
            - name: MyCustomPluginA
            - name: MyCustomPluginB
    - schedulerName: my-scheduler-3
      plugins:
        preScore:
          disabled:
            - name: '*'
        score:
          disabled:
            - name: '*'
    - schedulerName: my-scheduler-4

Admission Controllers

Every time we request the kubectl utility, it goes through the API Server.
And every time it hits the API server it performs authentication, usually done through a certificate
- This checks that only authorized users are making requests
Then we go through Authorization process, which checks if the current user has permission to perform the said task via RBAC
You can place in different kinds of restrictions
These are mostly at the Kubernetes API Level and not more
For Eg: in a pod-config file you want to check if the image specified is from a public domain or not, never use lthe atest tag for any images
- can be done via admission controllers
view enables Admission controller
- kube-apiserver -h | grep enable-admission-plugins → for kube-adm first get in the kube-control-place pod
  - u will see a list of admission controller that are enabled by default
- to modify just add
- --enable-admission-plugins=<Your plugins in kube-apiserver.service or in the yaml file if run as a pod

Validating and Mutating Admission Controllers

DefaultStorageClass → setup by default → if not already specified a storageClassName in PeristentVolume object then it sets one by default → This is known as Mutating Admission Controller
- it mutates or changes the objects itself before it is created
- Validating Admission Controller are ones that validate a request to allow or deny it
- Generally , mutating admission controller are invoked first followed by validating admission controller → so that any changes made by mutating admission controller can be validated later on after creating

Logging & Monitoring

Monitor Cluster Components
- tracking metrics at both the node and pod level
- For node, monitor
  - total number of nodes in a cluster
  - Health status of each node
  - Performance metrics such as CPU , memory , network and disk utilization
- for pods, monitor
  - number of running pods
  - CPU and memory consumption for every pod
- K8s doesn’t have built-in monitoring solution so use external tools
- Popular Open Source monitoring solution
  - Metrics Server
  - Prometheus
  - Elastic Stack
- Metrics Server → to be deployed once per k8s cluster
  - collects metrics from nodes and pods, aggregate data and retain it in memory
  - as it stores data in memory, it doesn’t support historical performance data
  - For long-term metrics → use other advance metric solution
- Within the kubelet, and integrate component cAdvisor (Container Advisor) is responsible for collecting performance metrics from running pods
- these metrics are exposed by kubelet API and retrieved by Metrics server
- Once Metrics Server is active, you can check resource consumption on node via kubectl top node and kubectl top pod
Managing application logs
- Docker containers typically log events to the standard output.
- if run in detached mode use docker logs -f <container_id>
- In K8s use, kubectl logs -f <pod-name>
- As K8s allow to run multiple containers within a pod, Attempting to view logs without specifying the container when multiple containers are present will result in an error. Instead, specify the container name to view its logs:
  - kubectl logs -f <pod-name> <container-name>

Learning Kubernetes: Week 2 – Scheduling, Monitoring & Workload Management

Table of contents