Kubernetes Cluster Maintenance
Kubernetes has emerged as the de facto standard for container orchestration, enabling organizations to deploy, scale, and manage containerized applications efficiently. As with any critical infrastructure component, regular maintenance is vital to ensure optimal performance, security, and reliability.
In this comprehensive guide, we'll delve into essential maintenance tasks for Kubernetes clusters:
Upgrading the cluster: Keeping your Kubernetes version up-to-date to leverage new features, improvements, and security patches.
Backing up and restoring data: Ensuring data integrity and availability in case of failures or disasters.
Scaling the cluster: Adjusting resources to meet the dynamic demands of your applications.
Whether you're a seasoned Kubernetes administrator or just starting out, this guide will provide practical insights and step-by-step instructions on how to maintain a robust and efficient Kubernetes environment.
Upgrading the Kubernetes Cluster
Upgrading your Kubernetes cluster is a critical maintenance task that ensures you benefit from the latest features, performance improvements, and security patches. However, it requires careful planning and execution to prevent application disruptions.
Why Upgrading is Essential
Kubernetes releases frequent updates that include new features, bug fixes, and security enhancements. Running outdated versions can expose your cluster to vulnerabilities and limit your ability to leverage improvements in performance and functionality.
Benefits of regular upgrades include:
Security: Patches for known vulnerabilities protect your cluster from potential attacks.
Performance: Optimizations and enhancements improve cluster efficiency.
Compatibility: Staying current ensures compatibility with the latest tools and integrations.
Support: Older versions eventually lose community and vendor support.
Pre-Upgrade Considerations
Before initiating an upgrade, it's crucial to prepare adequately to minimize risks and ensure a smooth transition.
1. Review Release Notes
Understanding the changes in the new version is vital. Review the official Kubernetes release notes for:
Deprecated features: Identify any features that are no longer supported and plan for alternatives.
API changes: Ensure your applications and configurations are compatible with API updates.
Bug fixes and enhancements: Understand improvements that could impact your workloads.
Example:
You can find release notes on the official Kubernetes GitHub repository:
https://github.com/kubernetes/kubernetes/releases
2. Backup Your Cluster
Always perform a complete backup before upgrading. This includes:
etcd database: Contains the cluster's state and configuration data.
Persistent volumes: Stores application data.
Cluster configurations: Includes manifests and deployment files.
We'll cover backup procedures in detail in the next section.
3. Test in a Staging Environment
Before upgrading your production cluster, replicate the process in a staging or testing environment to identify and resolve potential issues.
Steps:
Set up a staging cluster: Mirror your production environment as closely as possible.
Perform the upgrade: Follow the planned upgrade steps.
Validate applications: Ensure all applications function correctly post-upgrade.
Document issues: Note any problems encountered and their resolutions.
4. Plan for Downtime
While the goal is to achieve zero-downtime upgrades, it's prudent to plan for potential interruptions:
Notify stakeholders: Inform users and teams about the maintenance window.
Schedule during low-traffic periods: Reduce the impact on users by choosing optimal times for upgrades.
Prepare rollback plans: In case of critical issues, have a strategy to revert to the previous version.
Step-by-Step Upgrade Process
The upgrade process varies depending on your Kubernetes installation method and the components involved. We'll focus on upgrading clusters managed by Kubeadm, which is a popular and officially supported tool.
1. Verify Current Cluster Version
Start by checking the current version of your cluster components.
Command:
kubectl version --short
Example Output:
Client Version: v1.21.0
Server Version: v1.21.0
2. Check Available Versions
Identify the available Kubernetes versions you can upgrade to.
Command:
apt-cache madison kubeadm
Example Output:
kubeadm | 1.22.0-00 | https://apt.kubernetes.io/ kubernetes-xenial/main amd64 Packages
kubeadm | 1.21.3-00 | https://apt.kubernetes.io/ kubernetes-xenial/main amd64 Packages
Note: Ensure that you upgrade one minor version at a time (e.g., from 1.21.x to 1.22.x).
3. Drain the Master Node
Before upgrading the control plane components, cordon and drain the master node to prevent new pods from being scheduled and to safely evict existing pods.
Commands:
kubectl cordon <master-node-name>
kubectl drain <master-node-name> --ignore-daemonsets
Example:
kubectl cordon master-node
kubectl drain master-node --ignore-daemonsets
4. Upgrade kubeadm
Upgrade the kubeadm tool on the master node.
Commands:
sudo apt-get update && sudo apt-get install -y kubeadm=1.22.0-00
Verify the version:
kubeadm version
Example Output:
kubeadm version: &version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", ...}
5. Apply the Upgrade Plan
Let Kubeadm prepare the upgrade plan.
Command:
sudo kubeadm upgrade plan
This command will display the available versions and component upgrades.
6. Execute the Upgrade
Proceed with the upgrade using Kubeadm.
Command:
sudo kubeadm upgrade apply v1.22.0
This will upgrade the control plane components: kube-apiserver, kube-controller-manager, and kube-scheduler.
7. Upgrade kubelet and kubectl
After upgrading the control plane, update kubelet and kubectl.
Commands:
sudo apt-get install -y kubelet=1.22.0-00 kubectl=1.22.0-00
sudo systemctl daemon-reload
sudo systemctl restart kubelet
8. Uncordon the Master Node
Allow the master node to schedule pods again.
Command:
kubectl uncordon master-node
9. Upgrade Worker Nodes
Repeat the following steps for each worker node:
a. Drain the Node:
kubectl drain <worker-node-name> --ignore-daemonsets
b. Upgrade kubeadm:
sudo apt-get update && sudo apt-get install -y kubeadm=1.22.0-00
sudo kubeadm upgrade node
c. Upgrade kubelet and kubectl:
sudo apt-get install -y kubelet=1.22.0-00 kubectl=1.22.0-00
sudo systemctl daemon-reload
sudo systemctl restart kubelet
d. Uncordon the Node:
kubectl uncordon <worker-node-name>
Handling Zero-Downtime Upgrades
Achieving zero downtime during upgrades is critical for high-availability applications.
Strategies include:
Rolling upgrades: Upgrade nodes one at a time while the rest of the cluster continues to serve traffic.
Pod disruption budgets (PDBs): Define the minimum number of pods that should be running during disruptions.
Readiness probes: Ensure that services only receive traffic when they are ready to handle it.
Example of a Pod Disruption Budget:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
Applying the PDB:
kubectl apply -f pdb.yaml
Troubleshooting Upgrade Issues
Upgrades may encounter issues. Common problems and solutions include:
1. Failed to Drain Node
If pods with no replica sets prevent draining:
Solution:
- Identify the pod:
kubectl get pods --all-namespaces -o wide | grep <node-name>
- Delete the pod if appropriate:
kubectl delete pod <pod-name> -n <namespace>
2. Version Skew
Components running different Kubernetes versions.
Solution:
Ensure all components are upgraded sequentially.
Verify versions:
kubectl get nodes
kubectl version --short
3. etcd Upgrade Failures
Issues with etcd during control plane upgrade.
Solution:
- Check etcd status:
ETCDCTL_API=3 etcdctl endpoint health
- Restore etcd from backup if necessary.
Backing Up and Restoring Data
Data backup and recovery are critical components of Kubernetes cluster maintenance. Proper backups ensure that you can recover from data loss, corruption, or disasters with minimal downtime and data loss.
Understanding Backup Necessities
Backups provide a safety net against:
Hardware failures
Human errors
Cyber-attacks
Natural disasters
Regular backups and tested restoration procedures ensure business continuity and data integrity.
Components to Back Up
In Kubernetes, the primary components to back up include:
1. etcd Database
Description: A key-value store that holds the cluster's entire state and configuration.
Importance: Losing etcd data can render your cluster non-functional.
2. Persistent Volumes
Description: Storage volumes attached to pods for storing application data.
Importance: Contains critical application data that needs to be preserved.
3. Cluster Configurations and Manifests
Description: YAML files defining deployments, services, and other Kubernetes objects.
Importance: Necessary to recreate or restore cluster resources accurately.
Using Velero for Backups
Velero is an open-source tool for backing up and restoring Kubernetes cluster resources and persistent volumes.
Features:
Backup and restore of cluster resources and volumes
Scheduling of backups
Migration of resources between clusters
1. Installing Velero
Prerequisites:
Access to object storage: Velero supports various storage providers like AWS S3, Azure Blob Storage, and Google Cloud Storage.
Kubernetes cluster admin access
a. Install Velero CLI
Download the latest release from the Velero GitHub repository.
Example for Linux:
wget https://github.com/vmware-tanzu/velero/releases/download/v1.8.0/velero-v1.8.0-linux-amd64.tar.gz
tar -xvf velero-v1.8.0-linux-amd64.tar.gz
sudo mv velero-v1.8.0-linux-amd64/velero /usr/local/bin/
b. Configure Storage Provider
Create a credentials file for your storage provider.
Example for AWS:
[default]
aws_access_key_id=<YOUR_ACCESS_KEY>
aws_secret_access_key=<YOUR_SECRET_KEY>
c. Install Velero Server Components
Use the Velero CLI to install server components.
Command:
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.2.0 \
--bucket <YOUR_BUCKET_NAME> \
--backup-location-config region=<YOUR_REGION> \
--snapshot-location-config region=<YOUR_REGION> \
--secret-file ./credentials-velero
Verify Installation:
kubectl get pods -n velero
Example Output:
NAME READY STATUS RESTARTS AGE
velero-6c6f5b7b5d-x9qzf 1/1 Running 0 5m
Performing a Backup
With Velero installed, you can create backups of your cluster resources.
1. Backup All Resources
Command:
velero backup create full-backup
Verify Backup Status:
velero backup get
Example Output:
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION
full-backup Completed 0 0 2023-10-01 10:00:00 +0000 UTC 29d default
2. Backup Specific Namespace
Command:
velero backup create nginx-backup --include-namespaces=nginx
3. Schedule Regular Backups
Command:
velero schedule create daily-backup --schedule="0 2 * * *"
This schedules a backup every day at 2 AM.
Restoring from a Backup
Restoration is as important as backing up. Velero makes it straightforward to restore data.
1. List Available Backups
Command:
velero backup get
2. Perform a Restore
Command:
velero restore create --from-backup full-backup
Verify Restore Status:
velero restore get
Example Output:
NAME STATUS ERRORS WARNINGS CREATED SELECTOR
restore-full-backup Completed 0 0 2023-10-01 12:00:00 +0000 UTC <none>
3. Restore Specific Namespace
Command:
vlero restore create --from-backup nginx-backup --include-namespaces=nginx
Validating and Testing Backups
Regularly test your backups to ensure they can be restored successfully.
Steps:
Create a Test Cluster: Set up a non-production cluster.
Perform a Restore: Use Velero to restore backups to the test cluster.
Validate Data: Check if applications are running correctly and data integrity is maintained.
Document Results: Record the outcome and any issues encountered.
Automated Testing:
Implement automated scripts to perform and validate backups regularly.
Example Script:
#!/bin/bash
# Create a backup
velero backup create test-backup
# Wait for backup to complete
velero backup wait test-backup
# Simulate disaster by deleting a namespace
kubectl delete namespace nginx
# Restore from backup
velero restore create --from-backup test-backup
# Verify restoration
kubectl get namespaces | grep nginx
Scaling the Kubernetes Cluster
Scaling is essential to meet the varying demands of applications. Kubernetes provides robust mechanisms to scale both applications and the cluster itself.
The Importance of Scaling
Proper scaling ensures:
Performance: Applications can handle increased load without degradation.
Cost-efficiency: Resources are utilized optimally, reducing unnecessary expenses.
Reliability: High availability is maintained even during peak usage.
Vertical vs. Horizontal Scaling
Understanding the types of scaling is crucial for effective resource management.
1. Vertical Scaling
Definition: Increasing or decreasing the resource limits (CPU, memory) of individual pods.
Use Cases:
Applications that cannot be replicated easily.
When a single instance needs more resources to perform efficiently.
Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 1
template:
spec:
containers:
- name: my-app-container
image: my-app-image
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
To vertically scale, modify the resources
section.
2. Horizontal Scaling
Definition: Increasing or decreasing the number of pod replicas.
Use Cases:
Stateless applications that can run multiple instances in parallel.
Handling fluctuating loads effectively.
Example:
kubectl scale deployment my-app --replicas=5
Scaling Applications
1. Manual Scaling
Manually adjust the number of replicas.
Command:
kubectl scale deployment my-app --replicas=3
2. Automated Scaling with HPA
The Horizontal Pod Autoscaler (HPA) automatically scales the number of pods based on observed CPU utilization or custom metrics.
a. Enable Metrics Server
Ensure that the Metrics Server is deployed in your cluster.
Installation:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
b. Create HPA
Command:
kubectl autoscale deployment my-app --cpu-percent=50 --min=2 --max=10
This command creates an HPA that maintains CPU utilization at 50%, scaling between 2 and 10 replicas.
c. View HPA Status
kubectl get hpa
Example Output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
my-app Deployment/my-app 60%/50% 2 10 5 10m
3. Vertical Pod Autoscaler (VPA)
VPA adjusts the CPU and memory requests/limits of pods based on usage.
a. Install VPA
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml
b. Configure VPA
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Auto"
Apply Configuration:
kubectl apply -f vpa.yaml
Scaling Cluster Nodes
In addition to scaling applications, you may need to scale the cluster nodes to provide sufficient resources.
1. Cluster Autoscaler
The Cluster Autoscaler automatically adjusts the number of nodes in your cluster based on the resource requests of pods.
a. Install Cluster Autoscaler
For AWS EKS:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
b. Configure Autoscaler
Edit the deployment to set the appropriate parameters.
Example:
spec:
containers:
- image: k8s.gcr.io/cluster-autoscaler:v1.21.0
name: cluster-autoscaler
command:
- ./cluster-autoscaler
- --v=4
- --cluster-name=<your-cluster-name>
- --aws-use-static-instance-list=true
- --nodes=1:10:<node-group-name>
c. Set Auto-Scaling Groups
Ensure your cloud provider's node groups are configured to allow scaling within the specified range.
2. Manual Node Scaling
Manually add or remove nodes based on demand.
For AWS EKS:
aws eks update-nodegroup-config --cluster-name <cluster-name> --nodegroup-name <node-group-name> --scaling-config minSize=2,maxSize=5,desiredSize=3
Best Practices for Scaling
Monitor Metrics: Use monitoring tools like Prometheus and Grafana to track resource utilization.
Define Resource Requests and Limits: Properly set requests and limits for containers to enable effective scheduling and autoscaling.
Use Readiness and Liveness Probes: Ensure that scaling operations do not affect application availability.
Test Scaling Policies: Regularly test and validate your scaling configurations under different load scenarios.
Cost Management: Balance performance and cost by optimizing resource utilization.
Conclusion
Maintaining a Kubernetes cluster involves a combination of regular upgrades, reliable backup and restore strategies and efficient scaling practices. By following the guidelines outlined in this comprehensive guide, you can ensure that your Kubernetes environment remains secure, resilient, and capable of meeting the dynamic needs of your applications.
Key Takeaways:
Upgrading: Regularly update your cluster to benefit from the latest features and security patches while carefully planning and testing the process to avoid disruptions.
Backing Up and Restoring: Implement robust backup solutions like Velero and routinely test restoration procedures to safeguard against data loss.
Scaling: Utilize Kubernetes' scaling features, including HPA, VPA, and Cluster Autoscaler, to maintain optimal performance and resource utilization.
By integrating these maintenance practices into your operational workflows, you can achieve a high-performing, reliable, and scalable Kubernetes infrastructure that supports your organization's objectives.
Thank you for taking the time to read! 💚
Subscribe to my newsletter
Read articles from Rajat Chauhan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Rajat Chauhan
Rajat Chauhan
Rajat Chauhan is a skilled Devops Engineer, having experience in automating, configuring, deploying releasing and monitoring the applications on cloud environment. • Good experience in areas of DevOps, CI/CD Pipeline, Build and Release management, Hashicorp Terraform, Containerization, AWS, and Linux/Unix Administration. • As a DevOps Engineer, my objective is to strengthen the company’s applications and system features, configure servers and maintain networks to reinforce the company’s technical performance. • Ensure that environment is performing at its optimum level, manage system backups and provide infrastructure support. • Experience working on various DevOps technologies/ tools like GIT, GitHub Actions, Gitlab, Terraform, Ansible, Docker, Kubernetes, Helm, Jenkins, Prometheus and Grafana, and AWS EKS, DevOps, Jenkins. • Positive attitude, strong work ethic, and ability to work in a highly collaborative team environment. • Self-starter, Fast learner, and a Team player with strong interpersonal skills • Developed shell scripts (Bash) for automating day-to-day maintenance tasks on top of that have good python scripting skills. • Proficient in communication and project management with good experience in resolving issues.