From Simple Script to Production-Ready Automation: Building an AWS EC2 Scheduler

Md Sharjil AlamMd Sharjil Alam
6 min read

Let’s talk about a common AWS nightmare. You spin up a few EC2 instances for a development project on Monday. You get busy. You forgot to turn them off. The next month, you get a surprise on your AWS bill, paying for servers that were idle 70% of the time.

What if we could automate this? What if we could ensure our instances are only running when we need them?

This is a classic use case for serverless automation on AWS. In this article, we’ll build a solution using AWS Lambda and EventBridge to automatically start and stop our EC2 instances on a schedule. More importantly, we’ll explore the difference between a quick, simple script and a scalable, professional, production-ready solution.

The Simple (But Flawed) Approach

When you first think of this problem, the easiest solution comes to mind: write a script that knows which instances to stop.

You might write a simple Python function for AWS Lambda like this:

A simple stop_instances.py:

import boto3  # Hardcoded values
region = 'us-west-1'
instances = ['i-12345cb6de4f78g9h', 'i-08ce9b2d7eccf6d26']
ec2 = boto3.client('ec2', region_name=region)
def lambda_handler(event, context):
    ec2.stop_instances(InstanceIds=instances)
    print('stopped your instances: ' + str(instances))

This works. If you trigger this function, it will stop those two specific instances.

But this approach has serious problems:

  • It’s Brittle: What happens when you terminate one of those instances and create a new one? You have to find this code, change the instance ID, and redeploy your Lambda function.

  • It Doesn’t Scale: What if you have 20 instances to manage? Or 200? A hardcoded list becomes a maintenance nightmare.

  • It’s Not Configurable: The instance IDs and region are hardcoded directly in the script.

We can do much, much better.

The Professional, Tag-Based Solution

Instead of telling our function which specific instances to stop, we’ll use a more elegant approach: resource tagging. We’ll simply “mark” the instances we want to manage with a tag, and our Lambda function will be smart enough to find them.

This creates a decoupled and scalable system. You never have to touch the code again; you just manage the tags on your instances.

The Architecture

Our professional solution uses a few core AWS services working together:

  1. Amazon EventBridge: This is our scheduler. We’ll create two simple cron-based rules (e.g., “run at 7 PM every weekday” and “run at 8 AM every weekday”).

  2. AWS Lambda: This is our engine. We’ll have two functions — one for stopping instances and one for starting them. They will contain our smart, tag-aware Python code. (If you’re new to AWS Lambda, you can learn the basics by following my practical guide to your first serverless function here.)

  3. IAM (Identity and Access Management): This is our security. We’ll create a role with the exact permissions our Lambda function needs and nothing more (the Principle of Least Privilege).

  4. Amazon CloudWatch: This is our dashboard. Every print statement and log from our Lambda function goes here, making it easy to see what’s happening and debug any issues.

The Code: Built for the Real World

Here is the code for our stop_instances.py function. It looks more complex, but every line adds a layer of professionalism and reliability.

import boto3
import os
import logging
from botocore.exceptions import ClientError

# 1. Professional Logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

ec2 = boto3.client('ec2')

# 2. Configurable via Environment Variables
TAG_KEY = os.environ.get('TAG_KEY', 'Auto-Start-Stop')

def lambda_handler(event, context):
    logger.info("Starting function to stop EC2 instances...")

    # 3. Dynamic Filtering with Tags
    filters = [
        {
            'Name': f'tag:{TAG_KEY}',
            'Values': ['True', 'true']
        },
        {
            'Name': 'instance-state-name',
            'Values': ['running']
        }
    ]

    logger.info(f"Searching for instances with these filters: {filters}")

    try:
        reservations = ec2.describe_instances(Filters=filters)
        instances_to_stop = []
        for reservation in reservations['Reservations']:
            for instance in reservation['Instances']:
                instances_to_stop.append(instance['InstanceId'])

        # 4. Handle "No Instances Found" Gracefully
        if not instances_to_stop:
            logger.info("No running instances found matching the filters. Please check your instance tags and region.")
            return {'statusCode': 200, 'body': 'No instances to stop.'}

        logger.info(f"Found instances to stop: {', '.join(instances_to_stop)}")
        ec2.stop_instances(InstanceIds=instances_to_stop)

        success_message = f"Successfully sent stop command for instances: {', '.join(instances_to_stop)}"
        logger.info(success_message)

        return {'statusCode': 200, 'body': success_message}

    # 5. Robust Error Handling
    except ClientError as e:
        logger.error(f"An AWS API error occurred: {e}")
        return {'statusCode': 500, 'body': f"Error stopping instances: {e}"}
    except Exception as e:
        logger.error(f"An unexpected error occurred: {e}")
        return {'statusCode': 500, 'body': f"An unexpected error occurred: {e}"}

The start_instances.py function is very similar, but it looks for instances in the stopped state. You can find the full code for both on my GitHub repository here.

What makes this code professional?

  1. Logging: Instead of simple print() statements, we use Python’s logging module. This provides structured, searchable logs in CloudWatch, which is essential for debugging.

  2. Configuration: We pull the TAG_KEY from an environment variable. This means we can change the tag we’re looking for without touching the code.

  3. Dynamic Filtering: The describe_instances call uses a filter to find instances based on tags and state, which is the core of our scalable solution.

  4. Graceful Handling: The code checks if it found any instances and logs a helpful message if it didn’t. This prevents errors and makes debugging easier.

  5. Error Handling: The try…except block catches potential API errors and logs them, so the function won’t just crash silently.

Step-by-Step Setup Guide

Let’s configure this in the AWS Console.

  1. Create the IAM Role: Navigate to IAM and create a role for Lambda. Attach a policy that allows logs:*, ec2:DescribeInstances, ec2:StartInstances, and ec2:StopInstances.

2. Create the Lambda Functions: Go to the Lambda service and create two functions (start-instances and stop-instances). Use the Python 3.9+ runtime, paste in the respective code, and attach the IAM role you just created.

3. Create the EventBridge Schedules: In EventBridge, create two schedules.

  • Stop Schedule: Configure a cron expression (e.g., cron(0 19 ? MON-FRI ) for 7 PM on weekdays) and set the target to your stop-instances Lambda function.

  • Start Schedule: Configure a cron expression (e.g., cron(0 8 ? MON-FRI ) for 8 AM on weekdays) and set the target to your start-instances Lambda function.

4. Tag Your EC2 Instances: This is the final, easy step. Go to any EC2 instance you want to include in this schedule and add a tag:

  • Key: Auto-Start-Stop

  • Value: True

That’s it! The system is now fully automated.

Conclusion

We went from a simple, brittle script to a robust, scalable, and maintainable automation solution. By leveraging tags, environment variables, and proper logging, we built a tool that’s truly production-ready. This approach not only saves significant money on your AWS bill but also teaches fundamental principles of good cloud architecture.

If you want to dive deeper, you can check out the full project, including the code for both functions, on my GitHub repository.

Thanks for reading! If you found this helpful, please clap, follow, and join me for more practical cloud and DevOps guides.

🔗 Let’s Connect!

1
Subscribe to my newsletter

Read articles from Md Sharjil Alam directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Md Sharjil Alam
Md Sharjil Alam

🚀 DevOps & Cloud Engineer | AWS | CI/CD | Terraform | Docker | Golang | Kubernetes I'm a DevOps & Cloud Engineer passionate about automating infrastructure and building reliable, scalable cloud systems. I bring hands-on experience with AWS services, CI/CD pipelines, and Infrastructure as Code to streamline software delivery and enhance operational efficiency. From writing backend logic in Golang to provisioning cloud infra with Terraform, and deploying Dockerized apps using Jenkins, I’ve worked across the stack to integrate development and operations seamlessly. 🔧 Core Skills: DevOps: Jenkins, GitHub Actions, Docker, Ansible, Terraform Cloud: AWS (EC2, S3, IAM, Lambda, Route 53, CloudWatch) IaC & Automation: Terraform, Ansible, Shell scripting Containerization & Orchestration: Docker, Kubernetes Backend Development: Golang, REST APIs, MySQL, MongoDB Frontend (for full-stack apps): ReactJS, JavaScript, Tailwind CSS Tools: Git, GitHub, Linux, VS Code 🛠️ Project Highlights: ⚙️ Built automated CI/CD pipelines with Jenkins, Docker, and GitHub Actions ☁️ Deployed and managed staging/production environments on AWS 🔧 Provisioned cloud infrastructure using Terraform and Ansible 🧠 Wrote backend APIs in Go and connected to full-stack apps 📊 Set up IAM roles, monitoring (CloudWatch), and cloud security best practices 📚 I share learnings and tutorials on Hashnode. 📩 Let’s connect: mdsharjil32@gmail.com