Automated RDS Snapshots and Cross-Region DR

BekaBeka
5 min read

Automated RDS Snapshots and Cross-Region DR: A Comprehensive Guide

In today's data-driven world, ensuring the safety and availability of your databases is crucial. This guide will walk you through setting up automated snapshots and cross-region disaster recovery (DR) for your Amazon RDS database using AWS Lambda, EventBridge, and automatic snapshot deletion.

Architecture Overview

Before we dive into the implementation details, let's take a look at the overall architecture of our automated RDS snapshot and cross-region DR setup:

This architecture leverages AWS Lambda functions triggered by EventBridge rules to create regular snapshots of our RDS instance in the US East (N. Virginia) region. These snapshots are then automatically copied to the US West (Oregon) region for disaster recovery purposes. The system also includes a mechanism to clean up old snapshots, ensuring efficient resource management.

Implementation Steps

Step 1: Create an AWS Lambda Function for Automated Snapshots

  1. Navigate to AWS Lambda in the AWS Management Console.

  2. Create a new function named RDS_Snapshot_Automation using Python 3.9 runtime.

  3. Ensure the function has the necessary permissions:

    • AmazonRDSFullAccess

    • AmazonSNSFullAccess (for notifications)

    • AWSLambdaBasicExecutionRole

  4. Add the following Python code:

import boto3
from datetime import datetime

rds_client = boto3.client('rds')
sns_client = boto3.client('sns')

# Update this ARN to match your actual SNS topic ARN
SNS_TOPIC_ARN = 'arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:RDS-Snapshot-Notifications'

def lambda_handler(event, context):
    db_instance_identifier = 'database-1'
    current_time = datetime.now().strftime("%Y-%m-%d-%H-%M")
    snapshot_identifier = f"{db_instance_identifier}-snapshot-{current_time}"

    try:
        # Take a snapshot
        response = rds_client.create_db_snapshot(
            DBInstanceIdentifier=db_instance_identifier,
            DBSnapshotIdentifier=snapshot_identifier
        )
        message = f"Snapshot {snapshot_identifier} created successfully."
        print(message)

        # Publish success message to SNS
        sns_response = sns_client.publish(
            TopicArn=SNS_TOPIC_ARN,
            Message=message,
            Subject='RDS Snapshot Creation Success'
        )
        print(f"SNS publish response: {sns_response}")

    except Exception as e:
        error_message = f"Error: {str(e)}"
        print(error_message)

        # Publish failure message to SNS
        try:
            sns_response = sns_client.publish(
                TopicArn=SNS_TOPIC_ARN,
                Message=error_message,
                Subject='RDS Snapshot Creation Failed'
            )
            print(f"SNS publish response for error: {sns_response}")
        except Exception as sns_error:
            print(f"Failed to publish to SNS: {str(sns_error)}")

    return {
        'statusCode': 200,
        'body': 'Lambda function executed successfully'
    }

Step 2: Create an EventBridge Rule for Automated Snapshots

  1. Go to Amazon EventBridge in the AWS Console.

  2. Create a new rule named RDS_Snapshot_6Hr_Trigger.

  3. Set the event source as Schedule and use the cron expression cron(0 */6 * * ? *) to trigger every 6 hours.

  4. Set the target as the RDS_Snapshot_Automation Lambda function.

Step 3: Implement Cross-Region Snapshot Copying

  1. Create another Lambda function named RDS_Snapshot_Copy.

  2. Add the following Python code:

import boto3
from botocore.exceptions import ClientError

# Set up clients
rds_client = boto3.client('rds', region_name='us-east-1')
destination_region_client = boto3.client('rds', region_name='us-west-2')
sns_client = boto3.client('sns', region_name='us-east-1')

# Constants
SOURCE_REGION = 'us-east-1'
DESTINATION_REGION = 'us-west-2'
DB_INSTANCE_IDENTIFIER = 'database-1'
SNS_TOPIC_ARN = 'arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:RDS-Snapshot-Notifications'
ACCOUNT_ID = 'YOUR_ACCOUNT_ID'
DESTINATION_KMS_KEY_ID = 'YOUR_KMS_KEY_ID'  # Replace with your KMS key ID in us-west-2

def lambda_handler(event, context):
    try:
        # Get the latest snapshot
        snapshots = rds_client.describe_db_snapshots(DBInstanceIdentifier=DB_INSTANCE_IDENTIFIER)['DBSnapshots']
        if not snapshots:
            raise ValueError(f"No snapshots found for {DB_INSTANCE_IDENTIFIER}")

        latest_snapshot = max(snapshots, key=lambda x: x['SnapshotCreateTime'])
        snapshot_identifier = latest_snapshot['DBSnapshotIdentifier']
        copy_identifier = f"{snapshot_identifier}-copy-{DESTINATION_REGION}"

        # Construct the source ARN
        source_arn = f'arn:aws:rds:{SOURCE_REGION}:{ACCOUNT_ID}:snapshot:{snapshot_identifier}'

        # Copy snapshot to another region
        response = destination_region_client.copy_db_snapshot(
            SourceDBSnapshotIdentifier=source_arn,
            TargetDBSnapshotIdentifier=copy_identifier,
            SourceRegion=SOURCE_REGION,
            KmsKeyId=DESTINATION_KMS_KEY_ID,
            CopyTags=True
        )

        message = f"Snapshot {snapshot_identifier} copy initiated to {DESTINATION_REGION} as {copy_identifier}."
        print(message)

        # Publish success message to SNS
        sns_client.publish(
            TopicArn=SNS_TOPIC_ARN,
            Message=message,
            Subject=f'RDS Snapshot Copy Initiated to {DESTINATION_REGION}'
        )

        return {
            'statusCode': 200,
            'body': message
        }
    except ClientError as e:
        error_message = f"AWS Error copying snapshot: {e.response['Error']['Message']}"
        print(error_message)
        sns_client.publish(
            TopicArn=SNS_TOPIC_ARN,
            Message=error_message,
            Subject=f'RDS Snapshot Copy Failed to {DESTINATION_REGION}'
        )
        return {
            'statusCode': 500,
            'body': error_message
        }
    except Exception as e:
        error_message = f"Unexpected error: {str(e)}"
        print(error_message)
        sns_client.publish(
            TopicArn=SNS_TOPIC_ARN,
            Message=error_message,
            Subject=f'RDS Snapshot Copy Failed to {DESTINATION_REGION}'
        )
        return {
            'statusCode': 500,
            'body': error_message
        }
  1. Create an EventBridge rule named RDS_Snapshot_Copy_6Hr_Trigger to trigger this function every 6 hours.

Step 4: Automate Deletion of Old Snapshots

  1. Create a Lambda function named RDS_Delete_Old_Snapshots.

  2. Add the following Python code:

import boto3
from datetime import datetime, timedelta

rds_client = boto3.client('rds')
sns_client = boto3.client('sns')

SNS_TOPIC_ARN = 'arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:RDS-Snapshot-Notifications'

def lambda_handler(event, context):
    retention_days = 10
    db_instance_identifier = 'database-1'
    snapshots = rds_client.describe_db_snapshots(DBInstanceIdentifier=db_instance_identifier)['DBSnapshots']
    deletion_time = datetime.now() - timedelta(days=retention_days)

    for snapshot in snapshots:
        snapshot_creation_time = snapshot['SnapshotCreateTime']
        snapshot_identifier = snapshot['DBSnapshotIdentifier']

        if snapshot_creation_time < deletion_time:
            try:
                # Delete old snapshot
                rds_client.delete_db_snapshot(DBSnapshotIdentifier=snapshot_identifier)
                message = f"Deleted snapshot: {snapshot_identifier}"
                print(message)

                # Publish success message to SNS
                sns_client.publish(
                    TopicArn=SNS_TOPIC_ARN,
                    Message=message,
                    Subject='RDS Snapshot Deletion Success'
                )
            except Exception as e:
                error_message = f"Error deleting snapshot {snapshot_identifier}: {e}"
                print(error_message)

                # Publish failure message to SNS
                sns_client.publish(
                    TopicArn=SNS_TOPIC_ARN,
                    Message=error_message,
                    Subject='RDS Snapshot Deletion Failed'
                )
  1. Create an EventBridge rule to run this function daily using the cron expression cron(0 0 * * ? *).

Disaster Recovery Process

In the event of a failure in the primary region (US East N. Virginia), follow these steps to recover in the US West Oregon region:

  1. Trigger the Recovery Lambda function in US West Oregon.

  2. The Recovery Lambda will: a. Identify the latest copied snapshot in US West Oregon. b. Initiate the restore process from this snapshot to create a new RDS instance. c. Update Route 53 DNS records to point to the new RDS instance.

Here's a sample code for the Recovery Lambda function:

import boto3
import time

rds_client = boto3.client('rds', region_name='us-west-2')
route53_client = boto3.client('route53')

def lambda_handler(event, context):
    # Find the latest snapshot
    snapshots = rds_client.describe_db_snapshots(
        SnapshotType='manual',
        IncludeShared=False,
        IncludePublic=False
    )['DBSnapshots']

    latest_snapshot = max(snapshots, key=lambda x: x['SnapshotCreateTime'])

    # Restore the RDS instance from the snapshot
    new_instance_identifier = f"restored-{latest_snapshot['DBSnapshotIdentifier']}"
    response = rds_client.restore_db_instance_from_db_snapshot(
        DBInstanceIdentifier=new_instance_identifier,
        DBSnapshotIdentifier=latest_snapshot['DBSnapshotIdentifier'],
        PubliclyAccessible=False
    )

    # Wait for the instance to be available
    waiter = rds_client.get_waiter('db_instance_available')
    waiter.wait(DBInstanceIdentifier=new_instance_identifier)

    # Get the new endpoint
    instance_info = rds_client.describe_db_instances(DBInstanceIdentifier=new_instance_identifier)
    new_endpoint = instance_info['DBInstances'][0]['Endpoint']['Address']

    # Update Route 53
    route53_client.change_resource_record_sets(
        HostedZoneId='YOUR_HOSTED_ZONE_ID',
        ChangeBatch={
            'Changes': [
                {
                    'Action': 'UPSERT',
                    'ResourceRecordSet': {
                        'Name': 'your-db-domain.com',
                        'Type': 'CNAME',
                        'TTL': 300,
                        'ResourceRecords': [{'Value': new_endpoint}]
                    }
                }
            ]
        }
    )

    return {
        'statusCode': 200,
        'body': f'Recovery completed. New instance: {new_instance_identifier}, Endpoint: {new_endpoint}'
    }

This Recovery Lambda function should be set up in the US West Oregon region and can be triggered manually or automatically based on your specific disaster recovery plan.

Conclusion

By implementing this automated RDS snapshot and cross-region DR solution, you've significantly enhanced your database's resilience and recovery capabilities. Regular testing of the recovery process is crucial to ensure its effectiveness in real-world scenarios.

Remember to adjust the IAM roles, KMS keys, and other AWS resource identifiers to match your specific setup. Happy coding, and may your databases always be safe and available!

0
Subscribe to my newsletter

Read articles from Beka directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Beka
Beka