Stop Babysitting Your Storage: My 15-Minute AWS Auto-Scaling Solution

Hey there! πŸ‘‹ Let me walk you through something incredibly cool - automatic EBS volume resizing for your AWS Auto Scaling Groups. You know that panic when your disk space starts running low? We're going to make that a thing of the past by building a completely automated solution.

Think of this as giving your AWS infrastructure a smart brain that:

  1. Watches your disk space like a hawk πŸ¦…

  2. Spots when things are getting tight

  3. Automatically grows your storage before it becomes a problem

The best part? Once we set this up, it just works. Forever. Let's dive in!

Step 1: Setting Up Our Disk Space Monitor πŸ”

First, we need to give our EC2 instances the power to report their disk usage. It's like installing a smart meter in your house - it needs to be there before we can do anything clever with the readings.

SSH into one of your ASG instances:

ssh -i <your-key> user@address

Now, let's install our monitoring agent. Copy these commands exactly:

wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i -E ./amazon-cloudwatch-agent.deb
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
sudo systemctl start amazon-cloudwatch-agent

Next, you can create a new image from the ec2 instance and add to your ASG Launch Template as a new version, or you setup SSM Manager to install and setup the CloudWatch Agent on your entire ASG!

🚨 Pro Tip: Don't forget to attach the CloudWatchAgentServerPolicy to your instance! This is giving permission to send readings back to AWS.

Here's a gotcha that could save you time of debugging: If you're in a corporate environment and get permission errors while creating IAM roles, start fresh with a new role. Even if an error-prone role appears to work in the console, it might be secretly broken.

Step 2: Creating Our Data Collector Function πŸ€–

Let's create a Lambda function that will be our central data collector. Think of it as the brain that:

  1. Keeps track of all your EC2 instances

  2. Gathers their disk usage data

  3. Calculates averages to make smart decisions

Here's where we create listFetchCalAvgPub - a Lambda function that's way smarter than its somewhat awkward name suggests. This is essentially your disk space surveillance system.

Head to AWS Lambda and create a new function with Python 3.9. Here's the code that makes the magic happen:

import boto3
from datetime import datetime, timedelta
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    # Initialize our AWS toolset
    ec2 = boto3.client('ec2')
    asg = boto3.client('autoscaling')
    cw = boto3.client('cloudwatch')

    # Replace this with your ASG name
    asg_name = 'EndowdAutoScalingGroup'

This initializes our AWS toolset and ASG name.

response = asg.describe_auto_scaling_groups(AutoScalingGroupNames=[asg_name])

    if not response['AutoScalingGroups']:
        logger.warning(f"No ASG found with name: {asg_name}")
        return {
            'statusCode': 404,
            'body': f'ASG {asg_name} not found'
        }

This code is like a safety net - it makes sure your ASG actually exists before trying to do anything fancy.

Here's where it gets interesting:

metrics = []
    for instance_id, private_ip in instance_ips.items():
        response = cw.get_metric_statistics(
            Namespace='CWAgent',
            MetricName='disk_used_percent',
            Dimensions=[
                {'Name': 'host', 'Value': f'ip-{private_ip.replace(".", "-")}'},
                {'Name': 'path', 'Value': '/'},
                {'Name': 'device', 'Value': 'nvme0n1p1'},
                {'Name': 'fstype', 'Value': 'ext4'}
            ],
            StartTime=start_time,
            EndTime=end_time,
            Period=300,
            Statistics=['Average']
        )

This part is crucial - it's checking each instance's disk usage with incredible precision. It's like having a microscope for your storage usage! πŸ”¬

And finally, we calculate the average disk usage metric for the whole ASG! But wait! There's moreβ€”we also publish these as custom metrics to CloudWatch

    if metrics:
        avg_disk_usage = round(sum(metrics) / len(metrics), 2)
        logger.info(f"Average disk usage calculated: {avg_disk_usage}%")

        cw.put_metric_data(
            Namespace='CustomASGMetrics',
            MetricData=[
                {
                    'MetricName': 'AverageDiskUsagePercent',
                    'Dimensions': [
                        {'Name': 'AutoScalingGroupName', 'Value': asg_name},
                    ],
                    'Value': avg_disk_usage,
                    'Unit': 'Percent'
                },
            ]
        )

This function is like having a super-smart assistant that:

  1. Knows about every instance in your ASG

  2. Checks their disk usage

  3. Does the math to figure out if you're heading for trouble

Full code:

import boto3
from datetime import datetime, timedelta
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')
    asg = boto3.client('autoscaling')
    cw = boto3.client('cloudwatch')

    asg_name = 'EndowdAutoScalingGroup'
    response = asg.describe_auto_scaling_groups(AutoScalingGroupNames=[asg_name])

    if not response['AutoScalingGroups']:
        logger.warning(f"No ASG found with name: {asg_name}")
        return {
            'statusCode': 404,
            'body': f'ASG {asg_name} not found'
        }

    instances = response['AutoScalingGroups'][0]['Instances']
    if not instances:
        logger.info(f"ASG {asg_name} exists but has no instances")
        return {
            'statusCode': 200,
            'body': f'ASG {asg_name} has no instances'
        }

    instance_ids = [i['InstanceId'] for i in instances]
    logger.info(f"Found {len(instance_ids)} instances in ASG {asg_name}")

    ec2_info = ec2.describe_instances(InstanceIds=instance_ids)
    instance_ips = {instance['InstanceId']: instance['PrivateIpAddress'] 
                for reservation in ec2_info['Reservations'] 
                for instance in reservation['Instances']}

    end_time = datetime.utcnow()
    start_time = end_time - timedelta(minutes=10)


    metrics = []
    for instance_id, private_ip in instance_ips.items():
        response = cw.get_metric_statistics(
            Namespace='CWAgent',
            MetricName='disk_used_percent',
            Dimensions=[
                {'Name': 'host', 'Value': f'ip-{private_ip.replace(".", "-")}'},
                {'Name': 'path', 'Value': '/'},
                {'Name': 'device', 'Value': 'nvme0n1p1'},
                {'Name': 'fstype', 'Value': 'ext4'}
            ],
            StartTime=start_time,
            EndTime=end_time,
            Period=300,
            Statistics=['Average']
        )
        logger.info(f"Metric response for instance {instance_id} (IP: {private_ip}): {response}")
        if response['Datapoints']:
            metrics.append(response['Datapoints'][0]['Average'])
        else:
            logger.warning(f"No datapoints found for instance {instance_id} (IP: {private_ip})")

    if metrics:
        avg_disk_usage = round(sum(metrics) / len(metrics), 2)
        logger.info(f"Average disk usage calculated: {avg_disk_usage}%")

        cw.put_metric_data(
            Namespace='CustomASGMetrics',
            MetricData=[
                {
                    'MetricName': 'AverageDiskUsagePercent',
                    'Dimensions': [
                        {'Name': 'AutoScalingGroupName', 'Value': asg_name},
                    ],
                    'Value': avg_disk_usage,
                    'Unit': 'Percent'
                },
            ]
        )
        return {
            'statusCode': 200,
            'body': f'Published average disk usage: {avg_disk_usage}%'
        }
    else:
        logger.warning("No metrics data found for any instances")
        return {
            'statusCode': 200,
            'body': 'No disk usage data available for any instances'
        }

After we've created the function, we need to give it the right permissions. Think of this as giving your assistant the right security badges:

  1. Go to IAM

  2. Find your listFetchCalcAvgPub-xxxxxx role

  3. Add these crucial permissions:

    • AmazonEC2ReadOnlyAccess

    • AutoScalingReadOnlyAccess

    • CloudWatchFullAccessV2

Step 3: Setting Up Our Schedule ⏰

Now we need to make sure our function runs regularly. We'll use EventBridge (formerly CloudWatch Events) to run our function every 5 minutes. It's like setting up an extremely reliable alarm clock.

Ready to set up the schedule? I'll walk you through creating the EventBridge trigger next

Pro tip: The 5-minute interval is a sweet spot between getting timely data and not overwhelming your CloudWatch metrics. In production, you might want to adjust this based on how quickly your disk space typically fills up.

Okay, let's set up that reliable scheduling system and then move on to the really exciting part - the automatic scaling trigger!

Here's how to set up your schedule (it's actually pretty straightforward):

  1. Go to EventBridge console

  2. Find Scheduler: Schedules in the navigation

  3. Click Create schedule and follow this exact setup:

    • Name: "trigger-disk-usage-lambda" (or something equally descriptive)

    • Schedule pattern: "Recurring schedule"

    • Schedule type: "Rate-based schedule"

    • Rate: 5 minutes

    • Flexible time window: OFF (we want precision!)

    • Start time: Set it 1 minute in the future

🎯 Target setup:

  • Choose "AWS Lambda Invoke"

  • Select your listFetchCalAvgPub function

  • Leave payload empty

  • Let it create a new execution role

Step 4: Building Our Scaling Trigger πŸš€

Now for the exciting part - the ebs-scaling-trigger-asg Lambda function. This is like the commander that orders your EBS volumes to grow when needed. Here's the intelligent code that makes it happen:

This function is super smart - it does several critical things:

  1. Identifies instances that are running out of space

  2. Makes intelligent decisions about when to scale

  3. Triggers the actual scaling process through Step Functions

import json
import boto3
from datetime import datetime, timedelta

def json_serial(obj):
    """JSON serializer for objects not serializable by default json code"""
    if isinstance(obj, (datetime, timedelta)):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")

def get_instances_exceeding_threshold(asg_name, threshold):
    ec2 = boto3.client('ec2')
    asg = boto3.client('autoscaling')
    cw = boto3.client('cloudwatch')

This initializes our AWS toolset.

Here's where it gets really clever:

    high_usage_instances = []
    #####
            response = cw.get_metric_statistics(**query_params)
            print(f"Metric response for {instance_id}: {json.dumps(response, default=json_serial)}")

            if response['Datapoints']:
                latest_datapoint = max(response['Datapoints'], key=lambda x: x['Timestamp'])
                if latest_datapoint['Average'] >= threshold:
                    high_usage_instances.append(instance_id)
                    print(f"Instance {instance_id} exceeds threshold: {latest_datapoint['Average']}%")
                else:
                    print(f"Instance {instance_id} below threshold: {latest_datapoint['Average']}%")
            else:
                print(f"No datapoints found for instance {instance_id}")

    return high_usage_instances

def lambda_handler(event, context):
    print("Received event:", json.dumps(event))

    asg_name = 'EndowdAutoScalingGroup'
    threshold = 80

    high_usage_instances = get_instances_exceeding_threshold(asg_name, threshold)

    sf_client = boto3.client('stepfunctions')
    sf_arn = 'arn:aws:states:us-east-1:<account-id>:stateMachine:ebs-auto-scaling-asg'

    for instance_id in high_usage_instances:
        try:
            response = sf_client.start_execution(
                stateMachineArn=sf_arn,
                input=json.dumps({'instance_id': instance_id})
            )
            print(f"Started scaling process for instance {instance_id}: {json.dumps(response, default=json_serial)}")

This code is like having a smart assistant (lots of assistants, right! πŸ˜…) that:

  • Checks each instance individually

  • Gets detailed information about its current state

  • Decides if it needs attention

  • Runs step function for each instance exceeding threshold

Full code:

import json
import boto3
from datetime import datetime, timedelta

def json_serial(obj):
    """JSON serializer for objects not serializable by default json code"""
    if isinstance(obj, (datetime, timedelta)):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")

def get_instances_exceeding_threshold(asg_name, threshold):
    ec2 = boto3.client('ec2')
    asg = boto3.client('autoscaling')
    cw = boto3.client('cloudwatch')

    print(f"Checking ASG: {asg_name} for instances exceeding {threshold}%")

    try:
        response = asg.describe_auto_scaling_groups(AutoScalingGroupNames=[asg_name])
        instances = response['AutoScalingGroups'][0]['Instances']
        print(f"Found {len(instances)} instances in the ASG")
    except Exception as e:
        print(f"Error getting ASG instances: {str(e)}")
        return []

    high_usage_instances = []
    for instance in instances:
        instance_id = instance['InstanceId']
        try:
            ec2_info = ec2.describe_instances(InstanceIds=[instance_id])
            private_ip = ec2_info['Reservations'][0]['Instances'][0]['PrivateIpAddress']
            print(f"Checking instance {instance_id} with IP {private_ip}")

            query_params = {
                'Namespace': 'CWAgent',
                'MetricName': 'disk_used_percent',
                'Dimensions': [
                    {'Name': 'host', 'Value': f"ip-{private_ip.replace('.', '-')}"},
                    {'Name': 'path', 'Value': '/'},
                    {'Name': 'device', 'Value': 'nvme0n1p1'},
                    {'Name': 'fstype', 'Value': 'ext4'}
                ],
                'StartTime': datetime.utcnow() - timedelta(minutes=30),
                'EndTime': datetime.utcnow(),
                'Period': 300,
                'Statistics': ['Average']
            }
            print(f"CloudWatch query params: {json.dumps(query_params, default=json_serial)}")

            response = cw.get_metric_statistics(**query_params)
            print(f"Metric response for {instance_id}: {json.dumps(response, default=json_serial)}")

            if response['Datapoints']:
                latest_datapoint = max(response['Datapoints'], key=lambda x: x['Timestamp'])
                if latest_datapoint['Average'] >= threshold:
                    high_usage_instances.append(instance_id)
                    print(f"Instance {instance_id} exceeds threshold: {latest_datapoint['Average']}%")
                else:
                    print(f"Instance {instance_id} below threshold: {latest_datapoint['Average']}%")
            else:
                print(f"No datapoints found for instance {instance_id}")
        except Exception as e:
            print(f"Error processing instance {instance_id}: {str(e)}")
            import traceback
            print(traceback.format_exc())

    return high_usage_instances

def lambda_handler(event, context):
    print("Received event:", json.dumps(event))

    asg_name = 'EndowdAutoScalingGroup'
    threshold = 80

    high_usage_instances = get_instances_exceeding_threshold(asg_name, threshold)

    sf_client = boto3.client('stepfunctions')
    sf_arn = 'arn:aws:states:us-east-1:821675588512:stateMachine:ebs-auto-scaling-asg'

    for instance_id in high_usage_instances:
        try:
            response = sf_client.start_execution(
                stateMachineArn=sf_arn,
                input=json.dumps({'instance_id': instance_id})
            )
            print(f"Started scaling process for instance {instance_id}: {json.dumps(response, default=json_serial)}")
        except Exception as e:
            print(f"Error starting step function for instance {instance_id}: {str(e)}")
            import traceback
            print(traceback.format_exc())

    return {
        'statusCode': 200,
        'body': json.dumps(f'Started scaling process for {len(high_usage_instances)} instances')
    }

πŸ”‘ Don't forget the permissions! Add these to your function's IAM role:

  • AmazonEC2ReadOnlyAccess

  • AutoScalingReadOnlyAccess

  • AWSStepFunctionsFullAccess

Step 5: Setting Up the Alarm System 🚨

Now we need to create a CloudWatch Alarm that will be our early warning system. This is crucial - it's what triggers our scaling function before things get critical.

Let me guide you through the CloudWatch Alarm configuration and then move on to the final step - the Step Function that orchestrates the actual volume resizing!

This CloudWatch Alarm is like having a super-attentive assistant that never sleeps. Here's how to set it up:

  1. Head to CloudWatch -> All alarms

  2. Click "Create alarm" and follow these crucial steps:

    • Select CustomASGMetrics namespace

    • Drill down to AutoScalingGroupName

    • Pick your ASG name

  3. Now for the smart part - configure these settings:

    • Statistics: average (smooths out spikes)

    • Period: 1 minute (quick reaction time)

    • Threshold type: static

    • Condition: greater than/equal to

    • Threshold: Pick between 50-90% (I recommend 80% - gives you breathing room while avoiding false alarms)

🧠 Pro tip: The threshold choice is crucial. Too low (50%) triggers unnecessary scaling, too high (90%) might not give you enough reaction time. 80% is usually the sweet spot.

Step 6: The Grand Finale - Step Functions 🎭

This is where everything comes together! The Step Function is like an automation orchestra conductor, ensuring each scaling operation happens in perfect sequence.

Your Step Function will:

  1. Take the instance ID from our trigger

  2. Execute a series of precise steps to grow the EBS volume

  3. Ensure the filesystem is properly extended

  4. Verify everything worked correctly

Remember: The Step Function from the previous article is crucial here - it's the actual engine that performs the resizing operations.

So now we have a complete, intelligent system that:

  • Constantly monitors disk usage across your entire ASG

  • Automatically detects when any instance needs more space

  • Triggers a precise, automated scaling operation

  • Handles all the complex volume and filesystem operations without human intervention

🎯 The end result? Your EBS volumes now scale themselves automatically, exactly when needed, without any manual intervention. It's like having a self-healing infrastructure!

A few quick pro tips before we wrap up:

  1. Always test this in a non-production environment first

  2. Monitor the CloudWatch logs initially to ensure everything's working as expected

  3. Consider setting up SNS notifications for successful scaling operations

0
Subscribe to my newsletter

Read articles from Emmanuel Aladejana (Janor) directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Emmanuel Aladejana (Janor)
Emmanuel Aladejana (Janor)