Learning Kubernetes: Week 2 – Scheduling, Monitoring & Workload Management


Hi, This week was quite full of surprises, so I couldn’t learn much core concepts, but topics that help manage Kubernetes resources
Scheduling
Manual Scheduling
- Every pod definition has a field
nodeName
→ which by default is not set, Kubernetes adds the automatically, you don’t need to set it up
Scheduler looks through all the pods for those that do not have this property set → those are the candidates for scheduling
Once identified, it schedules the pod on the node by setting the nodename property → to the name of the node by creating a binding object
What happens when there is no Scheduler to monitor and schedule the pods
pods continue to be in a pending state
You can manually set
nodeName
field to schedule the pod on the nodeYou can only specify a nodeName at the creation of a pod
What if the pod is already created and you want to assign a different node?
K8s won’t allow you to modify the
nodeName
property of the podAnother method is to create a
pod-bind-definition
object and request to change the nodeName fieldpod-bind-defintion.yml
apiVersion: v1 kind: Binding metadata: name: nginx target: apiVersion: v1 kind: Node name: <nodeName>
Create the JSON equivalent of this file and send the request like this
curl --header "Content-Type:application/json" --request POST --data '{"apiVersion":"v1",.....} <http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding/
\>
Labels and selectors
When creating objects in k8s, we might end up with 100s of objects, so it becomes necessary to filter them
Either by type, application, or functionality
You can group and select objects via Labels and Selectors
For each object, attach labels as per your needs in the metadata field
While selecting, specify a condition to filter specific objects
pod-definition.yaml
apiVersion: v1 kind: Pod metadata: name: simple-webapp labels: app: App1 function: Front-end spec: containers: - name: simple-webapp image: simple-webapp ports: - containerPort:8080
# to select pod via labels
kubectl get pods --selector app=App1
K8s objects use labels and selectors internally to connect different objects
E.g.: In Replicaset
replicaset.yaml
apiVersion: apps/v1 kind: ReplicaSet metadata: name: myapp-replicaset labels: app: myapp type: front-end spec: template: metadata: name: myapp-pod labels: app: myapp type: front-pod spec: containers: - name: nginx-container image: nginx replicas: 3 selector: # help to check what pods are under it as it can also take pods that are not created by this yaml file matchLabels: type: front-pod
Annotations
- These are used to record other details for informational purposes
Management of K8s Objects
- K8s objects should be managed using only one technique. Mixing and matching techniques for the same object result in undefined
Management technique | Operates on | Recommended environment | Supported writers | Learning curve |
Imperative commands | Live objects | Development projects | 1+ | Lowest |
Imperative object configuration | Individual files | Production projects | 1 | Moderate |
Declarative object configuration | Directories of files | Production projects | 1+ | Highest |
Imperative Command
A user operates directly on live objects in a cluster
The user provides operations to
kubectl
command as arguments or flagsAdvantage
commands are expressed as a single action word
Commands required only a single step to make changes to the cluster
Disadvantages
Commands do not integrate with the change review process
Commands do not provide an audit trail associated with changes
Commands do not provide a source of records except for what is live
Commands do not provide a template for creating new objects
Imperative Object Config
Here kubectl command specifies the operation, an optional flag, and at least one file
The file specified must contain a full definition of the object in YAML or JSON format
replace
command replaces the existing spec with the newly provided one, dropping all changes to the object missing from the configuration file. This approach should not be used with resource types whose specs are updated independently of the configuration file. Services of type LoadBalancer
, for example, have their externalIPs
The field is updated independently of the configuration by the cluster.Eg:
kubectl create -f nginx.yaml
Advantages
Object configuration can be stored in a source control system such as Git.
Object configuration can integrate with processes such as reviewing changes before push and audit trails.
Object configuration provides a template for creating new objects.
Disadvantages
Object configuration requires a basic understanding of the object schema.
Object configuration requires the additional step of writing a YAML file.
Declarative Object Config
user operates on object config files stored locally.
Create, update, and delete operations are automatically detected • user doesn’t specify
patch
API operation to write only observed differences, instead of using the replace
API operation to replace the entire object configuration.Eg:
kubectl diff -R -f configs/ kubectl apply -R -f configs/ # -R -> recursively process directories
Advantage
Changes made directly to live objects are retained, even if they are not merged back into the configuration files.
Declarative object configuration has better support for operating on directories and automatically detecting operation types (create, patch, delete) per-object.
Disadvantage
Declarative object configuration is harder to debug and understand results when they are unexpected.
Partial updates using diffs create complex merge and patch operations
Taints and Tolerations
These are used as a restriction on what pods can be scheduled on a node
When pods are created, the K8s scheduler tries to place these pods on the available worker node
If we have created a specific resource available on Node 1 out of 3 for specific pods → We will first add a Taint (Eg: Taint = blue ) on our node
- When means unless specified other wise none of the pods can tolerate the taint → and won’t be placed on node 1
To allow let’s sa,y pod D (from A, B, C, D) to be placed on Node 1 → we will apply toleration on the pods D → D is now Tolerant to the Taint (blue)
Now, when the scheduler tries to place the pod D on node 1 → it can go through
How to set these?
Syntax:
kubectl taint nodes node-name key=value:taint-effect
the
taint-effect
→ defined what would happen to the pod if they do not tolerate the taintThere are 3 taintEffects
NoSchedule → Strict: No new pods are scheduled on the node unless they have a matching toleration.
- Existing pods are not affected.
PreferNoSchedule → Soft: Kubernetes tries to avoid scheduling pods on the node, but it's not enforced.
NoExecute → Evicts existing pods that don’t tolerate the taint.
- Also prevents new pods from scheduling unless they tolerate it.
NoExecute behavior with tolerationSeconds
tolerationSeconds | Behavior |
Not set | Pod stays on the node forever. |
Set to number | Pod is allowed to stay for that number of seconds, then it's evicted. |
No matching toleration | Pod is evicted immediately. |
eg:
kubectl taint nodes node1 app=blue:NoSchedule
Tolerance to pod
pod-definition.yml
apiVersion: v1 kind: pod metadata: name: myapp-pod spec: containers: - name: nginx-container image: nginx tolerations - key: app operator: "Equal" value: "blue" effct: "NoSchedule"
Of course, this doesn’t guarantee certain pods to schedule on certain nodes
Scheduler doesn’t schedule any pod on the Master node → cause of a taint applied to it
kubectl describe node <node-name> | grep Taint
When you define a tolerance, you can use an operator
:
The two possible operators are:
Equal
(default)Exists
If you don’t specify the operator explicitly, it defaults to Equal
.
A toleration matches a taint if:
The
key
is the sameThe
effect
is the same (e.g.,NoSchedule
,PreferNoSchedule
, orNoExecute
)One of the following is true:
The operator is
Exists
→ in this case, novalue
should be specified.The operator is
Equal
(default) → and the values must match.
If the toleration has an empty
key
:The operator must be
Exists
.This means it can match taints with any key (wildcard behavior).
But the effect still needs to match.
If the toleration has an empty
effect
:- It can match taints with any effect, but only those with the matching
key
.
- It can match taints with any effect, but only those with the matching
The default Kubernetes scheduler takes taints and tolerations into account when selecting a node to run a particular Pod. However, if you manually specify the
.spec.nodeName
For a Pod, that action bypasses the scheduler; the Pod is then bound onto the node where you assigned it, even if there areNoSchedule
taints on that node that you selected- The same thing happened with
NoExecute
taint, the kubelet will eject the Pod unless there is an appropriate tolerance set
- The same thing happened with
You can put multiple taints on the same node and multiple tolerations on the same pod
- k8s processes first with all nodes’ taints to filter, then ignores the ones for whichthe pod has a matching toleration
The built-in taints
node.kubernetes.io/not-ready
→ Node not ready → corresponds to the NodeConditionReady
being falsenode.kubernetes.io/unreachable
→ Node is unreachable → NodeConditionReady
being Unknownnode.kubernetes.io/memory-pressure
→ Node has memory pressurenode.kubernetes.io/disk-pressure
: Node has disk pressure.node.kubernetes.io/pid-pressure
: Node has PID pressure.node.kubernetes.io/network-unavailable
: Node's network is unavailable.node.kubernetes.io/unschedulable
: Node is unschedulable.node.cloudprovider.kubernetes.io/uninitialized
: When the kubelet is started with an "external" cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.
DaemonSet pods are created with
NoExecute
tolerations for the following taints with notolerationSeconds
:node.kubernetes.io/unreachable
node.kubernetes.io/not-ready
This ensures that DaemonSet pods are never evicted due to these problems.
The DaemonSet controller automatically adds the following
NoSchedule
tolerations to all daemons to prevent DaemonSets from breaking.node.kubernetes.io/memory-pressure
node.kubernetes.io/disk-pressure
node.kubernetes.io/pid-pressure
(1.14 or later)node.kubernetes.io/unschedulable
(1.10 or later)node.kubernetes.io/network-unavailable
(host network only)
The scheduler checks taints, not node conditions, when it makes scheduling decisions. This ensures that node conditions don't directly affect scheduling.
Sometimes, only one device on a node is faulty or under maintenance.
Tainting the whole node would block all pods, even those not using the bad device.
By tainting just the device, you:
Avoid disrupting other workloads.
Target only the pods that use the affected device.
NodeSelector
To limit a pod to run on a particular node based on certain labels, we use nodeSelector
pod-definition.yml
apiVersion: kind: Pod metadata: name: myapp-pod spec: containers: - name: data-processor image: data-processor nodeSelector: size: large
You cannot provide advanced operations like
NOT
orOR
with nodeselector
NodeAffinity
primary function → ensure pods are hosted or scheduled on a particular node
It also provides advanced capabilities for operations
apiVersion:
kind:
metadata:
name: myapp-pod
spec:
containers:
- name: data-processor
image: data-processor
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: size
operator: In
values:
- Large
- Medium
NodeAffinity Types
The type of node affinity defines the behavior of a scheduler with respect to NodeAffinity & stages in the lifecycle of a Pod
requierdDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDurngExecution
There are 2 stages in the lifecycle of a pod when considering Affinity
During Scheduling → a state where the pods do not exist and is created for the first time → when the pod is first created, the Affinity rules are required to place pods on the right node
required→ if the said label is not found on a node, this pod will not schedule
preferred → if the said label is not found on the node → the scheduler will ignore Affinity rules and place the pod on any available nodes
During execution → a pod has been running and a change has been made that affects the node affinity
ignored → if label is removed from the node → the pods running on the node will continue to run and any changes in node affinity will not impact them once schedules
required → the pod will be removed or terminated if the label is removed from the node
Resource Limits
a three-node Kubernetes cluster, each node has a set of CPU and MEM resources available
Every pod requires a set of CPU and MEM to run
Whenever a pod is placed on a node, it consumes the resources of the node
As the Scheduler, schedules the pod on the node, it looks at the amount of resources required by the pod and those available on the node to identify the best node to place the pod on
If the node doesn’t have sufficient resources available, the scheduler tries to avoid scheduling a pod on the node and instead places the pod on a node with sufficient resources available
If no node has sufficient resources, then the pod will be set in the Pending state (event: Insufficient cpu)
Resource Request→ the minimum amount of CPU and memory required by the pod to schedule the pod on the node
- To do this, add this to your pod definition under
spec.containers
- To do this, add this to your pod definition under
resources:
requests:
memory: "4Gi"
cpu: 2
Here CPU 1→ stands for 1 vCPU in AWS or 1 Core in GCP or Azure or 1 Hyperthread (lowest is 1m or 0.1)
for memory the lowest you can go to s 268M 268 Mi
1 G → Gigabyte → 1,000000000 byte
1 M → Megabyte → 1,000000 byte
1 K → Kilobyte → 1000 byte
1 Gi → Gibibyte → 1073741824 byte
1 Mi → 1048576 byte
1 Ki → 11024 byte
Resource Limits → to set the limit of the resource consumption by a pod
- To do this, add a limit section to the resource block
resources:
limits:
memory: "2Gi"
cpu: 2
Request and limit are set for the container in a pod
The system throttles the CPU so that it doesn’t go beyond the specified limit
- This is not the case with memory → a container can use more memory than its limit, if this is done constantly, the pod will be terminated → with error OOM
Behaviour
CPU
Without a resource or limit set, one pod can consume all the CPU and MEM resources and prevent other pods the required resources
In case you don’t have a request specified, but limits → k8s will automatically set the request same as limits
In case where request and limit both are set, each pod gets its requested resource and goes up to limit → most ideal, but in case where pod required more than limit and other pods are not consuming much resources, the pod cannot exceed its limit
Good case → Request set with no limit so that there is no upper limit to use resource → for this to work, make sure all pods have a request set
Memory
No Request, No limit set → one pod can eat up all memory
No Request, but limit set → request=limit → pods get resources up to limit and no more than that
Request, Limit set → each pod gets the requested resource, but can go up to the limit
Request but no limit set → each pod gets a guaranteed resource but with no upper limit in case needed more → in case one pod took the entire memory and another pod requires it, the only option left is to kill the 1st pod (as not throttle like CPU)
How do we ensure that every pod created has a default set?
to define default ranges for the limit in containers for the pod
If you create or change the limit range → it will not be affected in current pods but only for the upcoming ones
limit-range-cpu.yaml
apiVersion: v1 kind: LimitRange metadata: name: cpu-resource-constraint spec: limits: - default: cpu: 500m defaultRequest: cpu: 500m max: cpu: "1" min: cpu: 100m type: Container
limit-range-memory.yaml
apiVersion: v1 kind: LimitRange metadata: name: memory-resource-constraint spec: limits: - default: memory: 1Gi defaultRequest: memory: 1Gi max: memory: 1Gi min: memory: 500Mi type: Container
Any way to restrict the total amount of resources to be consumed by an application on a cluster
using Resource Quotas at the namespace level
to set hard limit for the request and limit
resource-quota.yaml
apiVersion: v1 kind: ResourceQuota metadata: name: my-resource-quota spec: hard: requests: cpu: 4 memory: 4Gi limits: cpu: 10 memory: 10Gi
DaemonSets
Daemonsets are like replica sets as they help deploy multiple instances of pods on the nodes → but they run one copy of your pod on each node in the cluster
When a new node is added to the cluster, the daemon set creates a replica of the pod on that node, and when the node is removed, the pod is automatically removed
DaemonSets ensure that one copy of the pod is always present in each node in the cluster
Use case → Monitoring Agent and Logs Viewer, kube-proxy
The Daemon-set definition is exactly as same as ReplicaSet except
kind: DaemonSet
kubectl get daemonsets
kubectl describe daemonsets <name>
How does it work?
After k8s v1.12, it uses the scheduler and affinity rules to schedule pods on nodes
Before that, it used to set the nodeName property to bypass the scheduler and directly schedule a pod on the nodes
StaticPods
kubelet relies on the kube-api-server for instructions on what pods to load on the node, which is based on a decision made by kube-scheduler → in the etcd data store
What if there are no other components than a worker node, not part of any cluster
The kubelet can manage a node independently (we have kubelet and Docker installed)
kubelet knows how to create pods, but no API server
how do you provide a pod-definition file to kubelet without a kube-api-server?
You can configure the kubelet to read the pod definition file from a directory on the server designated to store info about pods
/etc/kubernetes/manifests
kubelet periodically checks this directory for files, reads them, and creates pods
If the application crashes the Kubelet attempts to restart it
If you remove a file from the directory, the pod will be deleted automatically
kubelet works at the pod level and can only understand pods
This could be any directory on the host, and the location of that directory is passed in
kubelet.service
file in optionpod-manifest-path
or in the option
config
It will have kubeconfig.yaml file in the path is instaticPodPath
The kubectl utility won’t work as we don't have an api server now, and it works with that
You can check your pods via
docker ps
Priority Classes
These are non-namespaced objects; they are created outside of a namespace
Once they are created, they are available to be attached to any namespace on any pod
We define priority using a range of numbers (1,000,000,000 to -2,147,483,648) → a larger number indicates higher priority
- This range is for application or workloads that are defined as Apps on the cluster
There is a separate range for internal system critical pods (2,000,000,000 to 1,000,000,000)
kubectl get priorityclass
→ to list existing priority classespriority-class.yaml
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority value: 1000000000 description: "Priority class for mission critical pods"
Once created, we can associate this priority class with a pod by adding
priorityClassName: high-priority
to specIf we don’t specify a priorityClassName, then its assumed to be of 0 (by default)
- in order to change the default, just add
globalDefault: true
in your priorityClass.yml
- in order to change the default, just add
globalDefault
can only be defined in a single priority class in your clusterEffects of Pod Priority
First, the higher priority pod is given resources, and if some are left, then given to the lower priority one
In case your node doesn’t have any resource left and you have a new higher-priorty job, where do you place it ? → that behaviour is defined in your Priority class definition file under
preemptionPolicy
Its default value is set to
PreemtLowerPriority
→ kill the existing lower-priority job and fill its placenever
→ this waits for the resources to free up instead of killing
Multiple Schedulers
When creating a pod or a deployment, you can instruct the K8s cluster to have them scheduled by a specific scheduler
All schedulers must have a different name
Default scheduler is named
default-scheduler
scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration profiles: - schedulerNames: my-scheduler # if multiple copies of same scheduler are running on different nodes only one can be active at a time leaderElection: leaderElect: true resourceNamespace: kube-system resourceName: lock-object-my-scheduler
Deploy Additional Scheduler
wget <kube-scheduler-bin>
my-scheduler.service
ExecStart=/usr/local/bin/kube-scheduler \\\\ --config=/etc/kubernetes/config/my-scheduler-config.yaml
Deploy Additional Scheduler as a pod
custom-scheduler.yaml
apiVersion: v1 kind: Pod metadata: name: custom-scheduler namespace: kube-system spec: containers: - command: - kube-scheduler - --address=127.0.0.1 - --kubeconfig=/etc/kubernetes/scheduler.conf - --config=/etc/kubernetes/config/my-scheduler-config.yaml image: k8s.gcr.io/kube-scheduler-amd54:v1.11.3 name: kube-scheduler
to select the pod to deploy in the pod-definition under spec mention
schedulerName: my-custom-scheduler
If the scheduler is not configured currently the pod will remain in the Pending state
Configuring Scheduler Profiles
When a pod is defined, it enters a scheduling queue along with other pending pods
- If a pod required 10 CPU, it will only schedule on a node with at least 10 available CPU
pods with higher priority are placed at the beginning of the queue
Scheduling Phases
After being queued, pods progress through several phases:
Filter Phase: Nodes that cannot meet the pod's resource requirements (e.g., nodes lacking 10 CPUs) are filtered out.
Scoring Phase: Remaining nodes are scored based on resource availability after reserving the required CPU. For example, a node with 6 CPUs left scores higher than one with only 2.
Binding Phase: The pod is assigned to the node with the highest score.
Scheduling Plugin
Priority Sort plugin → sorts pods in scheduling queue according to priority
Node Resource Fit plugin → filter out nodes that do not have needed resources
Node Name plugin → Checks for a specific node name in the pod specification and filters nodes accordingly.
Node Unschedulable plugin → Exclude nodes marked as unschedulable (commands like drain or cordon will set the unschedulable flag)
Scoring Plugin → During the scoring phase, plugins (such as the Node Resources Fit and Image Locality plugins) assess each node's suitability.
- They assign scores rather than outright rejecting nodes.
Default Binder Plugin → Finalizes scheduling process by binding pod to selected node
Rather than running separate scheduler binaries for separate schedulers → k8s 1.8 introduced support for multiple scheduling profiles within a single scheduling binary
Profile Config
# my-scheduler-2-config.yaml apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration profiles: - schedulerName: my-scheduler-2 - schedulerName: my-scheduler-3 # my-scheduler-config.yaml apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration profiles: - schedulerName: my-scheduler # scheduler-config.yaml apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration profiles:
Each profile has many options for enabling and using plugins
apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration profiles: - schedulerName: my-scheduler-2 plugins: score: disabled: - name: TaintToleration enabled: - name: MyCustomPluginA - name: MyCustomPluginB - schedulerName: my-scheduler-3 plugins: preScore: disabled: - name: '*' score: disabled: - name: '*' - schedulerName: my-scheduler-4
Admission Controllers
Every time we request the kubectl utility, it goes through the API Server.
And every time it hits the API server it performs authentication, usually done through a certificate
- This checks that only authorized users are making requests
Then we go through Authorization process, which checks if the current user has permission to perform the said task via RBAC
You can place in different kinds of restrictions
These are mostly at the Kubernetes API Level and not more
For Eg: in a pod-config file you want to check if the image specified is from a public domain or not, never use lthe atest tag for any images
- can be done via admission controllers
view enables Admission controller
kube-apiserver -h | grep enable-admission-plugins
→ for kube-adm first get in the kube-control-place pod- u will see a list of admission controller that are enabled by default
to modify just add
--enable-admission-plugins=<Your plugins
in kube-apiserver.service or in the yaml file if run as a pod
Validating and Mutating Admission Controllers
DefaultStorageClass → setup by default → if not already specified a
storageClassName
in PeristentVolume object then it sets one by default → This is known asMutating Admission Controller
it mutates or changes the objects itself before it is created
Validating Admission Controller
are ones that validate a request to allow or deny itGenerally , mutating admission controller are invoked first followed by validating admission controller → so that any changes made by mutating admission controller can be validated later on after creating
Logging & Monitoring
Monitor Cluster Components
tracking metrics at both the node and pod level
For node, monitor
total number of nodes in a cluster
Health status of each node
Performance metrics such as CPU , memory , network and disk utilization
for pods, monitor
number of running pods
CPU and memory consumption for every pod
K8s doesn’t have built-in monitoring solution so use external tools
Popular Open Source monitoring solution
Metrics Server
Prometheus
Elastic Stack
Metrics Server → to be deployed once per k8s cluster
collects metrics from nodes and pods, aggregate data and retain it in memory
as it stores data in memory, it doesn’t support historical performance data
For long-term metrics → use other advance metric solution
Within the kubelet, and integrate component
cAdvisor
(Container Advisor) is responsible for collecting performance metrics from running podsthese metrics are exposed by kubelet API and retrieved by Metrics server
Once Metrics Server is active, you can check resource consumption on node via
kubectl top node
andkubectl top pod
Managing application logs
Docker containers typically log events to the standard output.
if run in detached mode use
docker logs -f <container_id>
In K8s use,
kubectl logs -f <pod-name>
As K8s allow to run multiple containers within a pod, Attempting to view logs without specifying the container when multiple containers are present will result in an error. Instead, specify the container name to view its logs:
kubectl logs -f <pod-name> <container-name>
Subscribe to my newsletter
Read articles from MRIDUL TIWARI directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

MRIDUL TIWARI
MRIDUL TIWARI
Software Engineer | Freelancer | Content Creator | Open Source Enthusiast | I Build Websites and Web Applications for Remote Clients.