Demystifying Kubernetes Scheduling and Pod Placement


As Kubernetes becomes the de facto standard for container orchestration, understanding how the scheduler makes placement decisions is essential. In this article, we walk through the key components that influence pod scheduling and placement, including labels, selectors, quotas, taints, topology rules, and more. Whether you're just getting started or brushing up your fundamentals, this guide breaks down the concepts like a Kubernetes expert would explain them to a beginner.
1. Kubernetes Scheduling Fundamentals
The scheduler is a control plane component responsible for selecting the most suitable node for a pod. Its decision-making involves a multi-step pipeline:
Queue: Pods waiting for scheduling sit in a queue.
Filter: Nodes that don't meet basic requirements (like insufficient resources or taint conflicts) are filtered out.
Score: Remaining nodes are scored based on policies (like resource balance).
Binding: The chosen node is recorded in the pod's spec by the scheduler.
This pipeline is foundational, and understanding it is key to advanced scheduling decisions.
Note: The scheduler operates independently of namespaces. That is, while namespaces logically separate resources, the scheduler doesn't consider namespace as part of its decision-making.
2. Labels, Selectors, and Annotations
Labels and Selectors:
Labels are key-value pairs attached to objects like pods or nodes. They're used to organize and select subsets of objects.
- Equality-based selectors:
matchLabels:
app: frontend
This selects objects where app=frontend
.
- Set-based selectors:
matchExpressions:
- key: env
operator: In
values:
- prod
- staging
This selects objects where env
is either prod
or staging
.
Annotations:
Labels can be used to select objects and to find collections of objects that satisfy certain conditions. In contrast, annotations are not used to identify and select objects. The metadata in an annotation can be small or large, structured or unstructured, and can include characters not permitted by labels. It is possible to use labels as well as annotations in the metadata of the same object.
1. Describing the Last User Who Modified a Resource
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: <json blob>
Used by kubectl apply
to track the last applied configuration. Helps with intelligent merges and diffs.
2. Adding External Documentation Links
metadata:
annotations:
documentation-url: "https://internal.docs.company.com/my-app"
Useful in enterprises where each service has a Confluence or documentation page.
3. Telling Ingress Controller How to Handle Requests
metadata:
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
Ingress controllers like NGINX or Traefik rely heavily on annotations to define behaviors like SSL redirect, path rewrites, etc.
4. Service Mesh Integration (e.g., Istio)
metadata:
annotations:
sidecar.istio.io/inject: "true"
Tells Istio to automatically inject its Envoy sidecar into the pod.
5. Custom Monitoring Tags
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
Prometheus uses these to decide whether to scrape metrics from a pod/service.
6. Backup Instructions for Velero
metadata:
annotations:
backup.velero.io/backup-volumes: "data-volume"
Velero, a backup/restore tool, uses this to identify which volumes to snapshot.
7. Adding Owner or Team Info (Internal Tracking)
metadata:
annotations:
owner: "devops-team"
contact-email: "devops@company.com"
Helps with ownership traceability — especially useful for internal policies.
8. Controlling Pod Security Policies (Deprecated in newer versions, still valid for legacy)
metadata:
annotations:
apparmor.security.beta.kubernetes.io/nginx: localhost/nginx-apparmor-profile
Used to assign AppArmor profiles to pods.
3. Namespaces: Logical Resource Boundaries
Namespaces are a way to logically group resources in a cluster. They're useful for multi-tenancy, separating dev/staging/prod, and applying policies.
They span across nodes (VMs).
You can set resource quotas on namespaces to limit how much CPU/memory they can consume.
Pro Tip: Scheduler is namespace-agnostic, meaning it schedules based on node resources and policies, not namespace boundaries.
4. Resource Quotas and Pod Overhead
Resource Quotas:
You can apply a ResourceQuota
on a namespace to limit total CPU/memory usage:
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
requests.cpu: "4"
limits.memory: "8Gi"
Pod Overhead:
Introduced to account for the resources consumed by the container runtime itself (like pause containers). It's added on top of a pod's resource requests.
5. Advanced Scheduling Rules
Taints and Tolerations:
- Taint: Applied on nodes to repel pods unless tolerated.
kubectl taint nodes node1 key=value:NoSchedule
Effects:
NoSchedule
: Pod won't be scheduled.PreferNoSchedule
: Scheduler tries to avoid but may schedule.NoExecute
: Evicts existing pods.
Tolerations: Pods declare which taints they can tolerate.
Node Affinity:
Used to express node preferences using labels:
requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution
Example:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
6. Topology Spread Constraints
Used to ensure pods are evenly distributed across zones, nodes, or other topologies.
Max Skew: Max difference in pod counts across topologies.
Topology Key: The label used to define the domain (e.g.,
topology.kubernetes.io/zone
).When Unsatisfiable:
DoNotSchedule
: Reject pod.ScheduleAnyway
: Allow, but prefer to honor constraints.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: frontend
7. Priority Classes and Preemption
Used to influence which pods get scheduled first during resource scarcity.
- Higher priority pods can preempt (evict) lower priority ones.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000
8. Node Lease (Heartbeat Mechanism)
Each node periodically renews its lease in the kube-node-lease
namespace. This reduces API server load while maintaining accurate node status.
If the lease isn’t renewed in time, Kubernetes assumes the node is down and evicts pods.
Final Thoughts
Kubernetes scheduling is both powerful and flexible. By understanding how labels, namespaces, taints, and priorities influence pod placement, you're well on your way to designing production-ready workloads. Whether you're building resilient systems or fine-tuning resource usage, the key lies in mastering these building blocks.
Ready to level up your workloads? Start small, experiment with taints, priorities, and quotas — and you'll soon think like the scheduler!
Subscribe to my newsletter
Read articles from sneh srivastava directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
