Kubernetes Cluster Maintenance

Rajat ChauhanRajat Chauhan
12 min read

Kubernetes has emerged as the de facto standard for container orchestration, enabling organizations to deploy, scale, and manage containerized applications efficiently. As with any critical infrastructure component, regular maintenance is vital to ensure optimal performance, security, and reliability.

In this comprehensive guide, we'll delve into essential maintenance tasks for Kubernetes clusters:

  • Upgrading the cluster: Keeping your Kubernetes version up-to-date to leverage new features, improvements, and security patches.

  • Backing up and restoring data: Ensuring data integrity and availability in case of failures or disasters.

  • Scaling the cluster: Adjusting resources to meet the dynamic demands of your applications.

Whether you're a seasoned Kubernetes administrator or just starting out, this guide will provide practical insights and step-by-step instructions on how to maintain a robust and efficient Kubernetes environment.


Upgrading the Kubernetes Cluster

Upgrading your Kubernetes cluster is a critical maintenance task that ensures you benefit from the latest features, performance improvements, and security patches. However, it requires careful planning and execution to prevent application disruptions.

Why Upgrading is Essential

Kubernetes releases frequent updates that include new features, bug fixes, and security enhancements. Running outdated versions can expose your cluster to vulnerabilities and limit your ability to leverage improvements in performance and functionality.

Benefits of regular upgrades include:

  • Security: Patches for known vulnerabilities protect your cluster from potential attacks.

  • Performance: Optimizations and enhancements improve cluster efficiency.

  • Compatibility: Staying current ensures compatibility with the latest tools and integrations.

  • Support: Older versions eventually lose community and vendor support.

Pre-Upgrade Considerations

Before initiating an upgrade, it's crucial to prepare adequately to minimize risks and ensure a smooth transition.

1. Review Release Notes

Understanding the changes in the new version is vital. Review the official Kubernetes release notes for:

  • Deprecated features: Identify any features that are no longer supported and plan for alternatives.

  • API changes: Ensure your applications and configurations are compatible with API updates.

  • Bug fixes and enhancements: Understand improvements that could impact your workloads.

Example:

You can find release notes on the official Kubernetes GitHub repository:

https://github.com/kubernetes/kubernetes/releases

2. Backup Your Cluster

Always perform a complete backup before upgrading. This includes:

  • etcd database: Contains the cluster's state and configuration data.

  • Persistent volumes: Stores application data.

  • Cluster configurations: Includes manifests and deployment files.

We'll cover backup procedures in detail in the next section.

3. Test in a Staging Environment

Before upgrading your production cluster, replicate the process in a staging or testing environment to identify and resolve potential issues.

Steps:

  • Set up a staging cluster: Mirror your production environment as closely as possible.

  • Perform the upgrade: Follow the planned upgrade steps.

  • Validate applications: Ensure all applications function correctly post-upgrade.

  • Document issues: Note any problems encountered and their resolutions.

4. Plan for Downtime

While the goal is to achieve zero-downtime upgrades, it's prudent to plan for potential interruptions:

  • Notify stakeholders: Inform users and teams about the maintenance window.

  • Schedule during low-traffic periods: Reduce the impact on users by choosing optimal times for upgrades.

  • Prepare rollback plans: In case of critical issues, have a strategy to revert to the previous version.

Step-by-Step Upgrade Process

The upgrade process varies depending on your Kubernetes installation method and the components involved. We'll focus on upgrading clusters managed by Kubeadm, which is a popular and officially supported tool.

1. Verify Current Cluster Version

Start by checking the current version of your cluster components.

Command:

kubectl version --short

Example Output:

Client Version: v1.21.0
Server Version: v1.21.0

2. Check Available Versions

Identify the available Kubernetes versions you can upgrade to.

Command:

apt-cache madison kubeadm

Example Output:

kubeadm | 1.22.0-00 | https://apt.kubernetes.io/ kubernetes-xenial/main amd64 Packages
kubeadm | 1.21.3-00 | https://apt.kubernetes.io/ kubernetes-xenial/main amd64 Packages

Note: Ensure that you upgrade one minor version at a time (e.g., from 1.21.x to 1.22.x).

3. Drain the Master Node

Before upgrading the control plane components, cordon and drain the master node to prevent new pods from being scheduled and to safely evict existing pods.

Commands:

kubectl cordon <master-node-name>
kubectl drain <master-node-name> --ignore-daemonsets

Example:

kubectl cordon master-node
kubectl drain master-node --ignore-daemonsets

4. Upgrade kubeadm

Upgrade the kubeadm tool on the master node.

Commands:

sudo apt-get update && sudo apt-get install -y kubeadm=1.22.0-00

Verify the version:

kubeadm version

Example Output:

kubeadm version: &version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", ...}

5. Apply the Upgrade Plan

Let Kubeadm prepare the upgrade plan.

Command:

sudo kubeadm upgrade plan

This command will display the available versions and component upgrades.

6. Execute the Upgrade

Proceed with the upgrade using Kubeadm.

Command:

sudo kubeadm upgrade apply v1.22.0

This will upgrade the control plane components: kube-apiserver, kube-controller-manager, and kube-scheduler.

7. Upgrade kubelet and kubectl

After upgrading the control plane, update kubelet and kubectl.

Commands:

sudo apt-get install -y kubelet=1.22.0-00 kubectl=1.22.0-00
sudo systemctl daemon-reload
sudo systemctl restart kubelet

8. Uncordon the Master Node

Allow the master node to schedule pods again.

Command:

kubectl uncordon master-node

9. Upgrade Worker Nodes

Repeat the following steps for each worker node:

a. Drain the Node:

kubectl drain <worker-node-name> --ignore-daemonsets

b. Upgrade kubeadm:

sudo apt-get update && sudo apt-get install -y kubeadm=1.22.0-00
sudo kubeadm upgrade node

c. Upgrade kubelet and kubectl:

sudo apt-get install -y kubelet=1.22.0-00 kubectl=1.22.0-00
sudo systemctl daemon-reload
sudo systemctl restart kubelet

d. Uncordon the Node:

kubectl uncordon <worker-node-name>

Handling Zero-Downtime Upgrades

Achieving zero downtime during upgrades is critical for high-availability applications.

Strategies include:

  • Rolling upgrades: Upgrade nodes one at a time while the rest of the cluster continues to serve traffic.

  • Pod disruption budgets (PDBs): Define the minimum number of pods that should be running during disruptions.

  • Readiness probes: Ensure that services only receive traffic when they are ready to handle it.

Example of a Pod Disruption Budget:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

Applying the PDB:

kubectl apply -f pdb.yaml

Troubleshooting Upgrade Issues

Upgrades may encounter issues. Common problems and solutions include:

1. Failed to Drain Node

If pods with no replica sets prevent draining:

Solution:

  • Identify the pod:
kubectl get pods --all-namespaces -o wide | grep <node-name>
  • Delete the pod if appropriate:
kubectl delete pod <pod-name> -n <namespace>

2. Version Skew

Components running different Kubernetes versions.

Solution:

  • Ensure all components are upgraded sequentially.

  • Verify versions:

kubectl get nodes
kubectl version --short

3. etcd Upgrade Failures

Issues with etcd during control plane upgrade.

Solution:

  • Check etcd status:
ETCDCTL_API=3 etcdctl endpoint health
  • Restore etcd from backup if necessary.

Backing Up and Restoring Data

Data backup and recovery are critical components of Kubernetes cluster maintenance. Proper backups ensure that you can recover from data loss, corruption, or disasters with minimal downtime and data loss.

Understanding Backup Necessities

Backups provide a safety net against:

  • Hardware failures

  • Human errors

  • Cyber-attacks

  • Natural disasters

Regular backups and tested restoration procedures ensure business continuity and data integrity.

Components to Back Up

In Kubernetes, the primary components to back up include:

1. etcd Database

  • Description: A key-value store that holds the cluster's entire state and configuration.

  • Importance: Losing etcd data can render your cluster non-functional.

2. Persistent Volumes

  • Description: Storage volumes attached to pods for storing application data.

  • Importance: Contains critical application data that needs to be preserved.

3. Cluster Configurations and Manifests

  • Description: YAML files defining deployments, services, and other Kubernetes objects.

  • Importance: Necessary to recreate or restore cluster resources accurately.


Using Velero for Backups

Velero is an open-source tool for backing up and restoring Kubernetes cluster resources and persistent volumes.

Features:

  • Backup and restore of cluster resources and volumes

  • Scheduling of backups

  • Migration of resources between clusters

1. Installing Velero

Prerequisites:

  • Access to object storage: Velero supports various storage providers like AWS S3, Azure Blob Storage, and Google Cloud Storage.

  • Kubernetes cluster admin access

a. Install Velero CLI

Download the latest release from the Velero GitHub repository.

Example for Linux:

wget https://github.com/vmware-tanzu/velero/releases/download/v1.8.0/velero-v1.8.0-linux-amd64.tar.gz
tar -xvf velero-v1.8.0-linux-amd64.tar.gz
sudo mv velero-v1.8.0-linux-amd64/velero /usr/local/bin/

b. Configure Storage Provider

Create a credentials file for your storage provider.

Example for AWS:

[default]
aws_access_key_id=<YOUR_ACCESS_KEY>
aws_secret_access_key=<YOUR_SECRET_KEY>

c. Install Velero Server Components

Use the Velero CLI to install server components.

Command:

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.2.0 \
    --bucket <YOUR_BUCKET_NAME> \
    --backup-location-config region=<YOUR_REGION> \
    --snapshot-location-config region=<YOUR_REGION> \
    --secret-file ./credentials-velero

Verify Installation:

kubectl get pods -n velero

Example Output:

NAME                      READY   STATUS    RESTARTS   AGE
velero-6c6f5b7b5d-x9qzf   1/1     Running   0          5m

Performing a Backup

With Velero installed, you can create backups of your cluster resources.

1. Backup All Resources

Command:

velero backup create full-backup

Verify Backup Status:

velero backup get

Example Output:

NAME          STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION
full-backup   Completed   0        0          2023-10-01 10:00:00 +0000 UTC   29d       default

2. Backup Specific Namespace

Command:

velero backup create nginx-backup --include-namespaces=nginx

3. Schedule Regular Backups

Command:

velero schedule create daily-backup --schedule="0 2 * * *"

This schedules a backup every day at 2 AM.

Restoring from a Backup

Restoration is as important as backing up. Velero makes it straightforward to restore data.

1. List Available Backups

Command:

velero backup get

2. Perform a Restore

Command:

velero restore create --from-backup full-backup

Verify Restore Status:

velero restore get

Example Output:

NAME                    STATUS      ERRORS   WARNINGS   CREATED                         SELECTOR
restore-full-backup     Completed   0        0          2023-10-01 12:00:00 +0000 UTC   <none>

3. Restore Specific Namespace

Command:

vlero restore create --from-backup nginx-backup --include-namespaces=nginx

Validating and Testing Backups

Regularly test your backups to ensure they can be restored successfully.

Steps:

  1. Create a Test Cluster: Set up a non-production cluster.

  2. Perform a Restore: Use Velero to restore backups to the test cluster.

  3. Validate Data: Check if applications are running correctly and data integrity is maintained.

  4. Document Results: Record the outcome and any issues encountered.

Automated Testing:

Implement automated scripts to perform and validate backups regularly.

Example Script:

#!/bin/bash

# Create a backup
velero backup create test-backup

# Wait for backup to complete
velero backup wait test-backup

# Simulate disaster by deleting a namespace
kubectl delete namespace nginx

# Restore from backup
velero restore create --from-backup test-backup

# Verify restoration
kubectl get namespaces | grep nginx

Scaling the Kubernetes Cluster

Scaling is essential to meet the varying demands of applications. Kubernetes provides robust mechanisms to scale both applications and the cluster itself.

The Importance of Scaling

Proper scaling ensures:

  • Performance: Applications can handle increased load without degradation.

  • Cost-efficiency: Resources are utilized optimally, reducing unnecessary expenses.

  • Reliability: High availability is maintained even during peak usage.

Vertical vs. Horizontal Scaling

Understanding the types of scaling is crucial for effective resource management.

1. Vertical Scaling

Definition: Increasing or decreasing the resource limits (CPU, memory) of individual pods.

Use Cases:

  • Applications that cannot be replicated easily.

  • When a single instance needs more resources to perform efficiently.

Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: my-app-container
        image: my-app-image
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "1Gi"

To vertically scale, modify the resources section.

2. Horizontal Scaling

Definition: Increasing or decreasing the number of pod replicas.

Use Cases:

  • Stateless applications that can run multiple instances in parallel.

  • Handling fluctuating loads effectively.

Example:

kubectl scale deployment my-app --replicas=5

Scaling Applications

1. Manual Scaling

Manually adjust the number of replicas.

Command:

kubectl scale deployment my-app --replicas=3

2. Automated Scaling with HPA

The Horizontal Pod Autoscaler (HPA) automatically scales the number of pods based on observed CPU utilization or custom metrics.

a. Enable Metrics Server

Ensure that the Metrics Server is deployed in your cluster.

Installation:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

b. Create HPA

Command:

kubectl autoscale deployment my-app --cpu-percent=50 --min=2 --max=10

This command creates an HPA that maintains CPU utilization at 50%, scaling between 2 and 10 replicas.

c. View HPA Status

kubectl get hpa

Example Output:

NAME     REFERENCE           TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
my-app   Deployment/my-app   60%/50%   2         10        5          10m

3. Vertical Pod Autoscaler (VPA)

VPA adjusts the CPU and memory requests/limits of pods based on usage.

a. Install VPA

kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml

b. Configure VPA

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment
    name:       my-app
  updatePolicy:
    updateMode: "Auto"

Apply Configuration:

kubectl apply -f vpa.yaml

Scaling Cluster Nodes

In addition to scaling applications, you may need to scale the cluster nodes to provide sufficient resources.

1. Cluster Autoscaler

The Cluster Autoscaler automatically adjusts the number of nodes in your cluster based on the resource requests of pods.

a. Install Cluster Autoscaler

For AWS EKS:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

b. Configure Autoscaler

Edit the deployment to set the appropriate parameters.

Example:

spec:
  containers:
  - image: k8s.gcr.io/cluster-autoscaler:v1.21.0
    name: cluster-autoscaler
    command:
    - ./cluster-autoscaler
    - --v=4
    - --cluster-name=<your-cluster-name>
    - --aws-use-static-instance-list=true
    - --nodes=1:10:<node-group-name>

c. Set Auto-Scaling Groups

Ensure your cloud provider's node groups are configured to allow scaling within the specified range.

2. Manual Node Scaling

Manually add or remove nodes based on demand.

For AWS EKS:

aws eks update-nodegroup-config --cluster-name <cluster-name> --nodegroup-name <node-group-name> --scaling-config minSize=2,maxSize=5,desiredSize=3

Best Practices for Scaling

  • Monitor Metrics: Use monitoring tools like Prometheus and Grafana to track resource utilization.

  • Define Resource Requests and Limits: Properly set requests and limits for containers to enable effective scheduling and autoscaling.

  • Use Readiness and Liveness Probes: Ensure that scaling operations do not affect application availability.

  • Test Scaling Policies: Regularly test and validate your scaling configurations under different load scenarios.

  • Cost Management: Balance performance and cost by optimizing resource utilization.


Conclusion

Maintaining a Kubernetes cluster involves a combination of regular upgrades, reliable backup and restore strategies and efficient scaling practices. By following the guidelines outlined in this comprehensive guide, you can ensure that your Kubernetes environment remains secure, resilient, and capable of meeting the dynamic needs of your applications.

Key Takeaways:

  • Upgrading: Regularly update your cluster to benefit from the latest features and security patches while carefully planning and testing the process to avoid disruptions.

  • Backing Up and Restoring: Implement robust backup solutions like Velero and routinely test restoration procedures to safeguard against data loss.

  • Scaling: Utilize Kubernetes' scaling features, including HPA, VPA, and Cluster Autoscaler, to maintain optimal performance and resource utilization.

By integrating these maintenance practices into your operational workflows, you can achieve a high-performing, reliable, and scalable Kubernetes infrastructure that supports your organization's objectives.


💡
If you need help or have any questions, just leave them in the comments! 📝 I would be happy to answer them!
💡
If you found this post useful, please give it a thumbs up 👍 and consider following for more helpful content. 😊

Thank you for taking the time to read! 💚

1
Subscribe to my newsletter

Read articles from Rajat Chauhan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Rajat Chauhan
Rajat Chauhan

Rajat Chauhan is a skilled Devops Engineer, having experience in automating, configuring, deploying releasing and monitoring the applications on cloud environment. • Good experience in areas of DevOps, CI/CD Pipeline, Build and Release management, Hashicorp Terraform, Containerization, AWS, and Linux/Unix Administration. • As a DevOps Engineer, my objective is to strengthen the company’s applications and system features, configure servers and maintain networks to reinforce the company’s technical performance. • Ensure that environment is performing at its optimum level, manage system backups and provide infrastructure support. • Experience working on various DevOps technologies/ tools like GIT, GitHub Actions, Gitlab, Terraform, Ansible, Docker, Kubernetes, Helm, Jenkins, Prometheus and Grafana, and AWS EKS, DevOps, Jenkins. • Positive attitude, strong work ethic, and ability to work in a highly collaborative team environment. • Self-starter, Fast learner, and a Team player with strong interpersonal skills • Developed shell scripts (Bash) for automating day-to-day maintenance tasks on top of that have good python scripting skills. • Proficient in communication and project management with good experience in resolving issues.