Introduction

In any Kubernetes cluster, etcd plays a vital role as it stores all the cluster's critical data, including configuration, state, and secrets. Since it acts as the "source of truth" for the entire cluster, ensuring the safety and integrity of etcd data is crucial for maintaining the health and continuity of your environment. A proper backup and restore strategy is essential for disaster recovery, cluster migrations, and maintaining data consistency.

Let’s dive into why and how we can back up and restore etcd in a Kubernetes cluster!

What is ETCD in Kubernetes?

etcd is a distributed, key-value store that stores the critical configuration data and state of a Kubernetes cluster. This includes information such as:

Cluster state
API objects (nodes, pods, services, secrets, etc.)
Configuration data required to run and manage the cluster.

It is the "source of truth" for the entire Kubernetes cluster, making it one of the most important components.

Why Do We Need to Backup and Restore ETCD?

Disaster Recovery: In case of hardware failures, accidental deletions, or corruption, you can restore the cluster to a previous state.
Cluster Migration: When moving the Kubernetes cluster to another environment or upgrading to a new version, you need the backup to migrate data.
Audit and Compliance: Regular backups help ensure your data is safe and can be restored in case of auditing needs.
Maintain State: To prevent loss of the cluster’s entire state (configurations, deployments, etc.), backups are crucial.

How to Backup and Restore ETCD

Prerequisites

Access to the Master Node: We should be able to SSH into our Kubernetes master node.
ETCD CLI tool: Ensure etcdctl (the etcd command-line client) is installed on the node where etcd runs.

To install it, run:
```
 sudo apt install etcd-client
```

Backup Steps

Set the Environment Variables
First, set the environment variable ETCDCTL_API=3 to specify the correct etcd API version:
```
 export ETCDCTL_API=3
```
Provide the Required Certificates and Endpoints
Extract the necessary endpoint, CA certificate, and keys from the /etc/kubernetes/manifests/etcd.yaml file.

Run the Backup Command
Use etcdctl to back up etcd:

 etcdctl --endpoints=https://127.0.0.1:2379 \
 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
 --cert=/etc/kubernetes/pki/etcd/server.crt \
 --key=/etc/kubernetes/pki/etcd/server.key \
 snapshot save /opt/etcd-backup.db

Verify the Backup
Check the size of the backup file:
```
 du -sh /opt/etcd-backup.db
```
To get detailed information about the snapshot, use:
```
 sudo etcdctl --write-out=table snapshot status /opt/etcd-backup.db
```

Restore Steps

Simulate a Failure
For demonstration, delete some resources such as deployments or services.

Run the Restore Command
Restore the etcd backup file using:

 sudo etcdctl --endpoints=https://127.0.0.1:2379 \
 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
 --cert=/etc/kubernetes/pki/etcd/server.crt \
 --key=/etc/kubernetes/pki/etcd/server.key \
 snapshot restore /opt/etcd-backup.db --data-dir=/var/lib/etcd-restore-from-backup

Update the ETCD Configuration
After restoring, update the etcd.yaml manifest file to point to the restored data directory,volume mountPath directory and hostPath directory.

Before

After

Restart the kubelet
Move all manifests temporarily to /tmp, then back to their original location to refresh the components:

 sudo systemctl stop kubelet
 sudo mv /etc/kubernetes/manifests/* /tmp
 sudo mv /tmp/* /etc/kubernetes/manifests/
 sudo systemctl start kubelet
 sudo systemctl daemon-reload

Verify the Restoration
After restarting the services, our pods and services should be up and running, confirming that the restoration was successful.

Conclusion

Regular backups of etcd are crucial for ensuring the health, recoverability, and continuity of a Kubernetes cluster. Whether for disaster recovery, migration, or maintaining consistency, the backup and restore process helps safeguard the critical state of your cluster.

Resources I used

https://www.youtube.com/watch?v=R2wuFCYgnm4&list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC&index=36

ETCD Backup and Restore Explained: Day 35 of 40daysofkubernetes

Table of contents