Day 35/40 Kubernetes ETCD Backup and Restore Explained

In this post, we will dive into Kubernetes' etcd component, explaining how to back it up and restore it. The health of a Kubernetes cluster is highly dependent on its etcd data store. Losing this data means losing the entire cluster's state and configuration. Therefore, it's essential to understand how to properly back up and restore etcd to ensure you can recover from a failure.

This guide will explain what etcd is, why it's crucial, and how to perform backup and restore operations on it. This information is vital for DevOps engineers, especially when managing production Kubernetes clusters.

What is etcd?

Etcd is a distributed, reliable key-value store used by Kubernetes to store all its cluster data. It holds critical information such as:

  • Cluster configuration

  • Pod, service, and deployment states

  • Role-based access control (RBAC) settings

  • Secrets, ConfigMaps, and other Kubernetes objects

In essence, etcd is the brain of a Kubernetes cluster. If etcd data is lost, the cluster will lose track of all running workloads, services, and configurations. Therefore, taking regular backups is a fundamental part of managing a Kubernetes cluster.

Why is etcd so important?

Because Kubernetes stores all its configuration data and state in etcd, losing this data could result in losing the entire cluster. This includes pod states, deployments, and services that make up your applications. In a worst-case scenario, if the etcd data becomes corrupt or is lost entirely without a backup, the cluster would need to be recreated from scratch.

Hence, regular etcd backups are crucial to ensure that in the case of data corruption, hardware failure, or even human error, the cluster can be restored and continue functioning as expected.

How to Take a Backup of etcd

Backups in Kubernetes are typically done using the etcdctl command-line tool. This tool allows you to interact directly with the etcd database to create backups, check the health of the cluster, and restore it in the event of a failure.

Prerequisites

Before you can perform an etcd backup, you need to have:

  1. Access to the etcd leader node.

  2. The etcdctl command-line tool installed on the node.

To ensure that etcdctl communicates with your cluster correctly, you also need the necessary environment variables configured for accessing your etcd cluster, like ETCDCTL_API, ETCD_ENDPOINTS, and ETCDCTL_CERT for authentication.

Command to Take a Backup

Run the following command to take a backup of etcd:

ETCDCTL_API=3 etcdctl snapshot save <backup-file-path> \
--endpoints=https://127.0.0.1:2379 \
--cacert=<ca-cert-path> \
--cert=<cert-path> \
--key=<key-path>
  • ETCDCTL_API=3: This specifies the API version to use (etcd v3 is the most common).

  • etcdctl snapshot save <backup-file-path>: Creates a snapshot of the etcd database at the specified file path.

  • --endpoints: Defines the etcd endpoint, usually https://127.0.0.1:2379 for local etcd instances.

  • --cacert, --cert, and --key: Provide SSL certificate paths for secure communication with the etcd cluster.

Example:

ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-backup.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key

This will create a backup of your etcd database and store it as /tmp/etcd-backup.db.

Additional Tips for Backup

  • Automate the process: Schedule periodic backups using a cron job or other job-scheduling mechanisms.

  • Store backups securely: Store your backups in a safe, reliable, and off-site location, like cloud storage (e.g., AWS S3, GCS, etc.).

  • Monitor backup health: Regularly check the integrity of your backups and ensure that they can be restored.

How to Restore etcd from a Backup

If you encounter a scenario where the etcd database is corrupted or lost, restoring from a backup will be necessary. The restore process is also done using etcdctl.

Prerequisites for Restore

  • The backup file you created earlier.

  • Access to the etcd node where the restore will happen.

Command to Restore etcd

Use the following command to restore etcd from a backup file:

ETCDCTL_API=3 etcdctl snapshot restore <backup-file-path> \
--data-dir=<new-data-dir>
  • etcdctl snapshot restore <backup-file-path>: Restores the etcd data from the specified backup file.

  • --data-dir: Specifies a new directory for the restored etcd data. Note: You must specify a different directory than the original etcd data directory.

Example:

ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-backup.db \ 
--data-dir=/var/lib/etcd-restored

Once you have restored the backup, reconfigure the etcd systemd service or container to point to the restored data directory (/var/lib/etcd-restored in the above example) and restart etcd.

Final Steps After Restoration

  1. Restart the etcd service:

     systemctl restart etcd
    
  2. Verify the health of your etcd cluster:

     ETCDCTL_API=3 etcdctl endpoint health \
     --endpoints=https://127.0.0.1:2379 \
     --cacert=/etc/kubernetes/pki/etcd/ca.crt \
     --cert=/etc/kubernetes/pki/etcd/server.crt \
     --key=/etc/kubernetes/pki/etcd/server.key
    

If the etcd cluster is healthy, your Kubernetes control plane should reconnect to etcd, and the cluster state will be restored from the snapshot.

Additional Tips for Restoring etcd

  • Test your backups: Regularly test backup restoration processes in non-production environments to ensure that your recovery plan works smoothly.

  • Prepare for downtime: Restoring from a backup will require cluster downtime, so plan maintenance windows accordingly.

Conclusion

Backing up and restoring etcd is an essential part of managing Kubernetes clusters. Regular backups safeguard your cluster's critical data and configurations, allowing you to recover quickly in case of failure.

By automating the backup process and securely storing these backups, you'll minimize the risk of losing critical data. In the event of an issue, the restore process can bring your Kubernetes cluster back to life using the snapshots you've created.

For DevOps engineers managing production environments, following these steps and best practices will ensure your cluster remains resilient and recoverable, even in the face of disaster.

Reference

https://www.youtube.com/watch?v=R2wuFCYgnm4&list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC&index=37

0
Subscribe to my newsletter

Read articles from Rahul Vadakkiniyil directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Rahul Vadakkiniyil
Rahul Vadakkiniyil