Stop Babysitting Your Storage: My 15-Minute AWS Auto-Scaling Solution
Hey there! π Let me walk you through something incredibly cool - automatic EBS volume resizing for your AWS Auto Scaling Groups. You know that panic when your disk space starts running low? We're going to make that a thing of the past by building a completely automated solution.
Think of this as giving your AWS infrastructure a smart brain that:
Watches your disk space like a hawk π¦
Spots when things are getting tight
Automatically grows your storage before it becomes a problem
The best part? Once we set this up, it just works. Forever. Let's dive in!
Step 1: Setting Up Our Disk Space Monitor π
First, we need to give our EC2 instances the power to report their disk usage. It's like installing a smart meter in your house - it needs to be there before we can do anything clever with the readings.
SSH into one of your ASG instances:
ssh -i <your-key> user@address
Now, let's install our monitoring agent. Copy these commands exactly:
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i -E ./amazon-cloudwatch-agent.deb
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
sudo systemctl start amazon-cloudwatch-agent
Next, you can create a new image from the ec2 instance and add to your ASG Launch Template as a new version, or you setup SSM Manager to install and setup the CloudWatch Agent on your entire ASG!
π¨ Pro Tip: Don't forget to attach the CloudWatchAgentServerPolicy
to your instance! This is giving permission to send readings back to AWS.
Here's a gotcha that could save you time of debugging: If you're in a corporate environment and get permission errors while creating IAM roles, start fresh with a new role. Even if an error-prone role appears to work in the console, it might be secretly broken.
Step 2: Creating Our Data Collector Function π€
Let's create a Lambda function that will be our central data collector. Think of it as the brain that:
Keeps track of all your EC2 instances
Gathers their disk usage data
Calculates averages to make smart decisions
Here's where we create listFetchCalAvgPub
- a Lambda function that's way smarter than its somewhat awkward name suggests. This is essentially your disk space surveillance system.
Head to AWS Lambda and create a new function with Python 3.9. Here's the code that makes the magic happen:
import boto3
from datetime import datetime, timedelta
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
# Initialize our AWS toolset
ec2 = boto3.client('ec2')
asg = boto3.client('autoscaling')
cw = boto3.client('cloudwatch')
# Replace this with your ASG name
asg_name = 'EndowdAutoScalingGroup'
This initializes our AWS toolset and ASG name.
response = asg.describe_auto_scaling_groups(AutoScalingGroupNames=[asg_name])
if not response['AutoScalingGroups']:
logger.warning(f"No ASG found with name: {asg_name}")
return {
'statusCode': 404,
'body': f'ASG {asg_name} not found'
}
This code is like a safety net - it makes sure your ASG actually exists before trying to do anything fancy.
Here's where it gets interesting:
metrics = []
for instance_id, private_ip in instance_ips.items():
response = cw.get_metric_statistics(
Namespace='CWAgent',
MetricName='disk_used_percent',
Dimensions=[
{'Name': 'host', 'Value': f'ip-{private_ip.replace(".", "-")}'},
{'Name': 'path', 'Value': '/'},
{'Name': 'device', 'Value': 'nvme0n1p1'},
{'Name': 'fstype', 'Value': 'ext4'}
],
StartTime=start_time,
EndTime=end_time,
Period=300,
Statistics=['Average']
)
This part is crucial - it's checking each instance's disk usage with incredible precision. It's like having a microscope for your storage usage! π¬
And finally, we calculate the average disk usage metric for the whole ASG! But wait! There's moreβwe also publish these as custom metrics to CloudWatch
if metrics:
avg_disk_usage = round(sum(metrics) / len(metrics), 2)
logger.info(f"Average disk usage calculated: {avg_disk_usage}%")
cw.put_metric_data(
Namespace='CustomASGMetrics',
MetricData=[
{
'MetricName': 'AverageDiskUsagePercent',
'Dimensions': [
{'Name': 'AutoScalingGroupName', 'Value': asg_name},
],
'Value': avg_disk_usage,
'Unit': 'Percent'
},
]
)
This function is like having a super-smart assistant that:
Knows about every instance in your ASG
Checks their disk usage
Does the math to figure out if you're heading for trouble
Full code:
import boto3
from datetime import datetime, timedelta
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
ec2 = boto3.client('ec2')
asg = boto3.client('autoscaling')
cw = boto3.client('cloudwatch')
asg_name = 'EndowdAutoScalingGroup'
response = asg.describe_auto_scaling_groups(AutoScalingGroupNames=[asg_name])
if not response['AutoScalingGroups']:
logger.warning(f"No ASG found with name: {asg_name}")
return {
'statusCode': 404,
'body': f'ASG {asg_name} not found'
}
instances = response['AutoScalingGroups'][0]['Instances']
if not instances:
logger.info(f"ASG {asg_name} exists but has no instances")
return {
'statusCode': 200,
'body': f'ASG {asg_name} has no instances'
}
instance_ids = [i['InstanceId'] for i in instances]
logger.info(f"Found {len(instance_ids)} instances in ASG {asg_name}")
ec2_info = ec2.describe_instances(InstanceIds=instance_ids)
instance_ips = {instance['InstanceId']: instance['PrivateIpAddress']
for reservation in ec2_info['Reservations']
for instance in reservation['Instances']}
end_time = datetime.utcnow()
start_time = end_time - timedelta(minutes=10)
metrics = []
for instance_id, private_ip in instance_ips.items():
response = cw.get_metric_statistics(
Namespace='CWAgent',
MetricName='disk_used_percent',
Dimensions=[
{'Name': 'host', 'Value': f'ip-{private_ip.replace(".", "-")}'},
{'Name': 'path', 'Value': '/'},
{'Name': 'device', 'Value': 'nvme0n1p1'},
{'Name': 'fstype', 'Value': 'ext4'}
],
StartTime=start_time,
EndTime=end_time,
Period=300,
Statistics=['Average']
)
logger.info(f"Metric response for instance {instance_id} (IP: {private_ip}): {response}")
if response['Datapoints']:
metrics.append(response['Datapoints'][0]['Average'])
else:
logger.warning(f"No datapoints found for instance {instance_id} (IP: {private_ip})")
if metrics:
avg_disk_usage = round(sum(metrics) / len(metrics), 2)
logger.info(f"Average disk usage calculated: {avg_disk_usage}%")
cw.put_metric_data(
Namespace='CustomASGMetrics',
MetricData=[
{
'MetricName': 'AverageDiskUsagePercent',
'Dimensions': [
{'Name': 'AutoScalingGroupName', 'Value': asg_name},
],
'Value': avg_disk_usage,
'Unit': 'Percent'
},
]
)
return {
'statusCode': 200,
'body': f'Published average disk usage: {avg_disk_usage}%'
}
else:
logger.warning("No metrics data found for any instances")
return {
'statusCode': 200,
'body': 'No disk usage data available for any instances'
}
After we've created the function, we need to give it the right permissions. Think of this as giving your assistant the right security badges:
Go to IAM
Find your
listFetchCalcAvgPub-xxxxxx
roleAdd these crucial permissions:
AmazonEC2ReadOnlyAccess
AutoScalingReadOnlyAccess
CloudWatchFullAccessV2
Step 3: Setting Up Our Schedule β°
Now we need to make sure our function runs regularly. We'll use EventBridge (formerly CloudWatch Events) to run our function every 5 minutes. It's like setting up an extremely reliable alarm clock.
Ready to set up the schedule? I'll walk you through creating the EventBridge trigger next
Pro tip: The 5-minute interval is a sweet spot between getting timely data and not overwhelming your CloudWatch metrics. In production, you might want to adjust this based on how quickly your disk space typically fills up.
Okay, let's set up that reliable scheduling system and then move on to the really exciting part - the automatic scaling trigger!
Here's how to set up your schedule (it's actually pretty straightforward):
Go to EventBridge console
Find
Scheduler: Schedules
in the navigationClick
Create schedule
and follow this exact setup:Name: "trigger-disk-usage-lambda" (or something equally descriptive)
Schedule pattern: "Recurring schedule"
Schedule type: "Rate-based schedule"
Rate: 5 minutes
Flexible time window: OFF (we want precision!)
Start time: Set it 1 minute in the future
π― Target setup:
Choose "AWS Lambda Invoke"
Select your
listFetchCalAvgPub
functionLeave payload empty
Let it create a new execution role
Step 4: Building Our Scaling Trigger π
Now for the exciting part - the ebs-scaling-trigger-asg
Lambda function. This is like the commander that orders your EBS volumes to grow when needed. Here's the intelligent code that makes it happen:
This function is super smart - it does several critical things:
Identifies instances that are running out of space
Makes intelligent decisions about when to scale
Triggers the actual scaling process through Step Functions
import json
import boto3
from datetime import datetime, timedelta
def json_serial(obj):
"""JSON serializer for objects not serializable by default json code"""
if isinstance(obj, (datetime, timedelta)):
return obj.isoformat()
raise TypeError(f"Type {type(obj)} not serializable")
def get_instances_exceeding_threshold(asg_name, threshold):
ec2 = boto3.client('ec2')
asg = boto3.client('autoscaling')
cw = boto3.client('cloudwatch')
This initializes our AWS toolset.
Here's where it gets really clever:
high_usage_instances = []
#####
response = cw.get_metric_statistics(**query_params)
print(f"Metric response for {instance_id}: {json.dumps(response, default=json_serial)}")
if response['Datapoints']:
latest_datapoint = max(response['Datapoints'], key=lambda x: x['Timestamp'])
if latest_datapoint['Average'] >= threshold:
high_usage_instances.append(instance_id)
print(f"Instance {instance_id} exceeds threshold: {latest_datapoint['Average']}%")
else:
print(f"Instance {instance_id} below threshold: {latest_datapoint['Average']}%")
else:
print(f"No datapoints found for instance {instance_id}")
return high_usage_instances
def lambda_handler(event, context):
print("Received event:", json.dumps(event))
asg_name = 'EndowdAutoScalingGroup'
threshold = 80
high_usage_instances = get_instances_exceeding_threshold(asg_name, threshold)
sf_client = boto3.client('stepfunctions')
sf_arn = 'arn:aws:states:us-east-1:<account-id>:stateMachine:ebs-auto-scaling-asg'
for instance_id in high_usage_instances:
try:
response = sf_client.start_execution(
stateMachineArn=sf_arn,
input=json.dumps({'instance_id': instance_id})
)
print(f"Started scaling process for instance {instance_id}: {json.dumps(response, default=json_serial)}")
This code is like having a smart assistant (lots of assistants, right! π ) that:
Checks each instance individually
Gets detailed information about its current state
Decides if it needs attention
Runs step function for each instance exceeding threshold
Full code:
import json
import boto3
from datetime import datetime, timedelta
def json_serial(obj):
"""JSON serializer for objects not serializable by default json code"""
if isinstance(obj, (datetime, timedelta)):
return obj.isoformat()
raise TypeError(f"Type {type(obj)} not serializable")
def get_instances_exceeding_threshold(asg_name, threshold):
ec2 = boto3.client('ec2')
asg = boto3.client('autoscaling')
cw = boto3.client('cloudwatch')
print(f"Checking ASG: {asg_name} for instances exceeding {threshold}%")
try:
response = asg.describe_auto_scaling_groups(AutoScalingGroupNames=[asg_name])
instances = response['AutoScalingGroups'][0]['Instances']
print(f"Found {len(instances)} instances in the ASG")
except Exception as e:
print(f"Error getting ASG instances: {str(e)}")
return []
high_usage_instances = []
for instance in instances:
instance_id = instance['InstanceId']
try:
ec2_info = ec2.describe_instances(InstanceIds=[instance_id])
private_ip = ec2_info['Reservations'][0]['Instances'][0]['PrivateIpAddress']
print(f"Checking instance {instance_id} with IP {private_ip}")
query_params = {
'Namespace': 'CWAgent',
'MetricName': 'disk_used_percent',
'Dimensions': [
{'Name': 'host', 'Value': f"ip-{private_ip.replace('.', '-')}"},
{'Name': 'path', 'Value': '/'},
{'Name': 'device', 'Value': 'nvme0n1p1'},
{'Name': 'fstype', 'Value': 'ext4'}
],
'StartTime': datetime.utcnow() - timedelta(minutes=30),
'EndTime': datetime.utcnow(),
'Period': 300,
'Statistics': ['Average']
}
print(f"CloudWatch query params: {json.dumps(query_params, default=json_serial)}")
response = cw.get_metric_statistics(**query_params)
print(f"Metric response for {instance_id}: {json.dumps(response, default=json_serial)}")
if response['Datapoints']:
latest_datapoint = max(response['Datapoints'], key=lambda x: x['Timestamp'])
if latest_datapoint['Average'] >= threshold:
high_usage_instances.append(instance_id)
print(f"Instance {instance_id} exceeds threshold: {latest_datapoint['Average']}%")
else:
print(f"Instance {instance_id} below threshold: {latest_datapoint['Average']}%")
else:
print(f"No datapoints found for instance {instance_id}")
except Exception as e:
print(f"Error processing instance {instance_id}: {str(e)}")
import traceback
print(traceback.format_exc())
return high_usage_instances
def lambda_handler(event, context):
print("Received event:", json.dumps(event))
asg_name = 'EndowdAutoScalingGroup'
threshold = 80
high_usage_instances = get_instances_exceeding_threshold(asg_name, threshold)
sf_client = boto3.client('stepfunctions')
sf_arn = 'arn:aws:states:us-east-1:821675588512:stateMachine:ebs-auto-scaling-asg'
for instance_id in high_usage_instances:
try:
response = sf_client.start_execution(
stateMachineArn=sf_arn,
input=json.dumps({'instance_id': instance_id})
)
print(f"Started scaling process for instance {instance_id}: {json.dumps(response, default=json_serial)}")
except Exception as e:
print(f"Error starting step function for instance {instance_id}: {str(e)}")
import traceback
print(traceback.format_exc())
return {
'statusCode': 200,
'body': json.dumps(f'Started scaling process for {len(high_usage_instances)} instances')
}
π Don't forget the permissions! Add these to your function's IAM role:
AmazonEC2ReadOnlyAccess
AutoScalingReadOnlyAccess
AWSStepFunctionsFullAccess
Step 5: Setting Up the Alarm System π¨
Now we need to create a CloudWatch Alarm that will be our early warning system. This is crucial - it's what triggers our scaling function before things get critical.
Let me guide you through the CloudWatch Alarm configuration and then move on to the final step - the Step Function that orchestrates the actual volume resizing!
This CloudWatch Alarm is like having a super-attentive assistant that never sleeps. Here's how to set it up:
Head to CloudWatch -> All alarms
Click "Create alarm" and follow these crucial steps:
Select
CustomASGMetrics
namespaceDrill down to
AutoScalingGroupName
Pick your ASG name
Now for the smart part - configure these settings:
Statistics: average (smooths out spikes)
Period: 1 minute (quick reaction time)
Threshold type: static
Condition: greater than/equal to
Threshold: Pick between 50-90% (I recommend 80% - gives you breathing room while avoiding false alarms)
π§ Pro tip: The threshold choice is crucial. Too low (50%) triggers unnecessary scaling, too high (90%) might not give you enough reaction time. 80% is usually the sweet spot.
Step 6: The Grand Finale - Step Functions π
This is where everything comes together! The Step Function is like an automation orchestra conductor, ensuring each scaling operation happens in perfect sequence.
Your Step Function will:
Take the instance ID from our trigger
Execute a series of precise steps to grow the EBS volume
Ensure the filesystem is properly extended
Verify everything worked correctly
Remember: The Step Function from the previous article is crucial here - it's the actual engine that performs the resizing operations.
So now we have a complete, intelligent system that:
Constantly monitors disk usage across your entire ASG
Automatically detects when any instance needs more space
Triggers a precise, automated scaling operation
Handles all the complex volume and filesystem operations without human intervention
π― The end result? Your EBS volumes now scale themselves automatically, exactly when needed, without any manual intervention. It's like having a self-healing infrastructure!
A few quick pro tips before we wrap up:
Always test this in a non-production environment first
Monitor the CloudWatch logs initially to ensure everything's working as expected
Consider setting up SNS notifications for successful scaling operations
Subscribe to my newsletter
Read articles from Emmanuel Aladejana (Janor) directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by