Automated RDS Snapshots and Cross-Region DR
Automated RDS Snapshots and Cross-Region DR: A Comprehensive Guide
In today's data-driven world, ensuring the safety and availability of your databases is crucial. This guide will walk you through setting up automated snapshots and cross-region disaster recovery (DR) for your Amazon RDS database using AWS Lambda, EventBridge, and automatic snapshot deletion.
Architecture Overview
Before we dive into the implementation details, let's take a look at the overall architecture of our automated RDS snapshot and cross-region DR setup:
This architecture leverages AWS Lambda functions triggered by EventBridge rules to create regular snapshots of our RDS instance in the US East (N. Virginia) region. These snapshots are then automatically copied to the US West (Oregon) region for disaster recovery purposes. The system also includes a mechanism to clean up old snapshots, ensuring efficient resource management.
Implementation Steps
Step 1: Create an AWS Lambda Function for Automated Snapshots
Navigate to AWS Lambda in the AWS Management Console.
Create a new function named
RDS_Snapshot_Automation
using Python 3.9 runtime.Ensure the function has the necessary permissions:
AmazonRDSFullAccess
AmazonSNSFullAccess (for notifications)
AWSLambdaBasicExecutionRole
Add the following Python code:
import boto3
from datetime import datetime
rds_client = boto3.client('rds')
sns_client = boto3.client('sns')
# Update this ARN to match your actual SNS topic ARN
SNS_TOPIC_ARN = 'arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:RDS-Snapshot-Notifications'
def lambda_handler(event, context):
db_instance_identifier = 'database-1'
current_time = datetime.now().strftime("%Y-%m-%d-%H-%M")
snapshot_identifier = f"{db_instance_identifier}-snapshot-{current_time}"
try:
# Take a snapshot
response = rds_client.create_db_snapshot(
DBInstanceIdentifier=db_instance_identifier,
DBSnapshotIdentifier=snapshot_identifier
)
message = f"Snapshot {snapshot_identifier} created successfully."
print(message)
# Publish success message to SNS
sns_response = sns_client.publish(
TopicArn=SNS_TOPIC_ARN,
Message=message,
Subject='RDS Snapshot Creation Success'
)
print(f"SNS publish response: {sns_response}")
except Exception as e:
error_message = f"Error: {str(e)}"
print(error_message)
# Publish failure message to SNS
try:
sns_response = sns_client.publish(
TopicArn=SNS_TOPIC_ARN,
Message=error_message,
Subject='RDS Snapshot Creation Failed'
)
print(f"SNS publish response for error: {sns_response}")
except Exception as sns_error:
print(f"Failed to publish to SNS: {str(sns_error)}")
return {
'statusCode': 200,
'body': 'Lambda function executed successfully'
}
Step 2: Create an EventBridge Rule for Automated Snapshots
Go to Amazon EventBridge in the AWS Console.
Create a new rule named
RDS_Snapshot_6Hr_Trigger
.Set the event source as Schedule and use the cron expression
cron(0 */6 * * ? *)
to trigger every 6 hours.Set the target as the
RDS_Snapshot_Automation
Lambda function.
Step 3: Implement Cross-Region Snapshot Copying
Create another Lambda function named
RDS_Snapshot_Copy
.Add the following Python code:
import boto3
from botocore.exceptions import ClientError
# Set up clients
rds_client = boto3.client('rds', region_name='us-east-1')
destination_region_client = boto3.client('rds', region_name='us-west-2')
sns_client = boto3.client('sns', region_name='us-east-1')
# Constants
SOURCE_REGION = 'us-east-1'
DESTINATION_REGION = 'us-west-2'
DB_INSTANCE_IDENTIFIER = 'database-1'
SNS_TOPIC_ARN = 'arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:RDS-Snapshot-Notifications'
ACCOUNT_ID = 'YOUR_ACCOUNT_ID'
DESTINATION_KMS_KEY_ID = 'YOUR_KMS_KEY_ID' # Replace with your KMS key ID in us-west-2
def lambda_handler(event, context):
try:
# Get the latest snapshot
snapshots = rds_client.describe_db_snapshots(DBInstanceIdentifier=DB_INSTANCE_IDENTIFIER)['DBSnapshots']
if not snapshots:
raise ValueError(f"No snapshots found for {DB_INSTANCE_IDENTIFIER}")
latest_snapshot = max(snapshots, key=lambda x: x['SnapshotCreateTime'])
snapshot_identifier = latest_snapshot['DBSnapshotIdentifier']
copy_identifier = f"{snapshot_identifier}-copy-{DESTINATION_REGION}"
# Construct the source ARN
source_arn = f'arn:aws:rds:{SOURCE_REGION}:{ACCOUNT_ID}:snapshot:{snapshot_identifier}'
# Copy snapshot to another region
response = destination_region_client.copy_db_snapshot(
SourceDBSnapshotIdentifier=source_arn,
TargetDBSnapshotIdentifier=copy_identifier,
SourceRegion=SOURCE_REGION,
KmsKeyId=DESTINATION_KMS_KEY_ID,
CopyTags=True
)
message = f"Snapshot {snapshot_identifier} copy initiated to {DESTINATION_REGION} as {copy_identifier}."
print(message)
# Publish success message to SNS
sns_client.publish(
TopicArn=SNS_TOPIC_ARN,
Message=message,
Subject=f'RDS Snapshot Copy Initiated to {DESTINATION_REGION}'
)
return {
'statusCode': 200,
'body': message
}
except ClientError as e:
error_message = f"AWS Error copying snapshot: {e.response['Error']['Message']}"
print(error_message)
sns_client.publish(
TopicArn=SNS_TOPIC_ARN,
Message=error_message,
Subject=f'RDS Snapshot Copy Failed to {DESTINATION_REGION}'
)
return {
'statusCode': 500,
'body': error_message
}
except Exception as e:
error_message = f"Unexpected error: {str(e)}"
print(error_message)
sns_client.publish(
TopicArn=SNS_TOPIC_ARN,
Message=error_message,
Subject=f'RDS Snapshot Copy Failed to {DESTINATION_REGION}'
)
return {
'statusCode': 500,
'body': error_message
}
- Create an EventBridge rule named
RDS_Snapshot_Copy_6Hr_Trigger
to trigger this function every 6 hours.
Step 4: Automate Deletion of Old Snapshots
Create a Lambda function named
RDS_Delete_Old_Snapshots
.Add the following Python code:
import boto3
from datetime import datetime, timedelta
rds_client = boto3.client('rds')
sns_client = boto3.client('sns')
SNS_TOPIC_ARN = 'arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:RDS-Snapshot-Notifications'
def lambda_handler(event, context):
retention_days = 10
db_instance_identifier = 'database-1'
snapshots = rds_client.describe_db_snapshots(DBInstanceIdentifier=db_instance_identifier)['DBSnapshots']
deletion_time = datetime.now() - timedelta(days=retention_days)
for snapshot in snapshots:
snapshot_creation_time = snapshot['SnapshotCreateTime']
snapshot_identifier = snapshot['DBSnapshotIdentifier']
if snapshot_creation_time < deletion_time:
try:
# Delete old snapshot
rds_client.delete_db_snapshot(DBSnapshotIdentifier=snapshot_identifier)
message = f"Deleted snapshot: {snapshot_identifier}"
print(message)
# Publish success message to SNS
sns_client.publish(
TopicArn=SNS_TOPIC_ARN,
Message=message,
Subject='RDS Snapshot Deletion Success'
)
except Exception as e:
error_message = f"Error deleting snapshot {snapshot_identifier}: {e}"
print(error_message)
# Publish failure message to SNS
sns_client.publish(
TopicArn=SNS_TOPIC_ARN,
Message=error_message,
Subject='RDS Snapshot Deletion Failed'
)
- Create an EventBridge rule to run this function daily using the cron expression
cron(0 0 * * ? *)
.
Disaster Recovery Process
In the event of a failure in the primary region (US East N. Virginia), follow these steps to recover in the US West Oregon region:
Trigger the Recovery Lambda function in US West Oregon.
The Recovery Lambda will: a. Identify the latest copied snapshot in US West Oregon. b. Initiate the restore process from this snapshot to create a new RDS instance. c. Update Route 53 DNS records to point to the new RDS instance.
Here's a sample code for the Recovery Lambda function:
import boto3
import time
rds_client = boto3.client('rds', region_name='us-west-2')
route53_client = boto3.client('route53')
def lambda_handler(event, context):
# Find the latest snapshot
snapshots = rds_client.describe_db_snapshots(
SnapshotType='manual',
IncludeShared=False,
IncludePublic=False
)['DBSnapshots']
latest_snapshot = max(snapshots, key=lambda x: x['SnapshotCreateTime'])
# Restore the RDS instance from the snapshot
new_instance_identifier = f"restored-{latest_snapshot['DBSnapshotIdentifier']}"
response = rds_client.restore_db_instance_from_db_snapshot(
DBInstanceIdentifier=new_instance_identifier,
DBSnapshotIdentifier=latest_snapshot['DBSnapshotIdentifier'],
PubliclyAccessible=False
)
# Wait for the instance to be available
waiter = rds_client.get_waiter('db_instance_available')
waiter.wait(DBInstanceIdentifier=new_instance_identifier)
# Get the new endpoint
instance_info = rds_client.describe_db_instances(DBInstanceIdentifier=new_instance_identifier)
new_endpoint = instance_info['DBInstances'][0]['Endpoint']['Address']
# Update Route 53
route53_client.change_resource_record_sets(
HostedZoneId='YOUR_HOSTED_ZONE_ID',
ChangeBatch={
'Changes': [
{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': 'your-db-domain.com',
'Type': 'CNAME',
'TTL': 300,
'ResourceRecords': [{'Value': new_endpoint}]
}
}
]
}
)
return {
'statusCode': 200,
'body': f'Recovery completed. New instance: {new_instance_identifier}, Endpoint: {new_endpoint}'
}
This Recovery Lambda function should be set up in the US West Oregon region and can be triggered manually or automatically based on your specific disaster recovery plan.
Conclusion
By implementing this automated RDS snapshot and cross-region DR solution, you've significantly enhanced your database's resilience and recovery capabilities. Regular testing of the recovery process is crucial to ensure its effectiveness in real-world scenarios.
Remember to adjust the IAM roles, KMS keys, and other AWS resource identifiers to match your specific setup. Happy coding, and may your databases always be safe and available!
Subscribe to my newsletter
Read articles from Beka directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by