Automating AWS EC2 CloudWatch Agent Monitoring & Email Alerting with Lambda and CDK


Ensuring that the CloudWatch Agent is installed and running on all EC2 instances is crucial for complete observability. In this guide, we’ll explore why monitoring the CloudWatch Agent matters, why you should alert on missing/stopped agents, and how to implement an automated checker using an AWS Lambda function. We’ll then walk through deploying the solution with AWS CDK (TypeScript) for a repeatable setup. The language will be friendly and the approach hands-on, so you can follow along easily.
Why CloudWatch Agent Monitoring is Important for EC2
AWS EC2 instances by default only send a limited set of metrics to CloudWatch (CPU, network, etc.). The CloudWatch Agent extends this by collecting additional system metrics and logs from inside the instance . For example, the CloudWatch Agent can report memory usage, disk utilization, detailed OS metrics, and even custom application metrics, which are not available through the default EC2 monitoring . It also streams system logs (and custom log files) to CloudWatch Logs for centralized logging .
By running the CloudWatch Agent on your servers, you get a much more comprehensive view of instance health. Memory consumption, disk space, swap usage, and application logs are critical for diagnosing issues, and the CloudWatch Agent gathers these with minimal effort. In short, CloudWatch Agent monitoring is vital because it ensures that you’re not flying blind on important metrics and logs that go beyond the basic EC2 data.
Why You Need Alerts for Missing or Stopped Agents
If the CloudWatch Agent is missing or stopped on an EC2 instance, you effectively lose visibility into that instance’s detailed metrics and logs. This is a serious blind spot: imagine a scenario where an application is consuming all memory on a server, but you have no CloudWatch metrics or logs to alert you because the agent that collects them isn’t running. By the time you realize there’s a problem, it might be too late to prevent an outage.
Setting up alerts for CloudWatch Agent status ensures that you are notified as soon as an agent isn’t running when it should be. This proactive alerting allows your team to quickly remediate the issue (install or restart the agent) before it impacts monitoring or operations. It’s essentially monitoring your monitoring – a safety net that catches misconfigurations or failures in the telemetry pipeline. In real-world use, this kind of alert can save hours of troubleshooting during incidents, because you’ll immediately know if lack of metrics is due to an agent issue. It also helps maintain compliance with any internal policies that all instances must have monitoring active. The bottom line: if the CloudWatch Agent stops, you want to know right away so you can fix it and restore full visibility.
Building an Automated CloudWatch Agent Status Checker (Lambda + SSM)
To automatically detect and alert on CloudWatch Agent issues, we’ll build a Python AWS Lambda function that runs on a schedule. This function will use AWS Systems Manager (SSM) to remotely check each EC2 instance and verify the CloudWatch Agent’s status. If the agent is not installed or not running on any instance, the Lambda will send an email alert (via Amazon SES) in Markdown format summarizing the problem. Here’s how it works:
Discovering EC2 Instances via SSM
First, the Lambda function needs to know which EC2 instances to check. We leverage AWS Systems Manager for this, since SSM can enumerate instances that have the SSM Agent running. Using the SSM API describe_instance_information (or its boto3 equivalent), the Lambda can list all managed instances. We typically filter this to instances that are currently online with SSM (PingStatus = “Online”) . This ensures we target only instances that are up and have the SSM agent available to run commands. You could also filter by tags (for example, only check instances with a specific tag like Monitoring=true if you don’t want to cover every instance), but the key is that SSM gives us a reliable inventory of instances to probe.
In Python (boto3), this might look like:
ssm = boto3.client('ssm')
# Get all online managed instances
response = ssm.describe_instance_information(
Filters=[{ 'Key': 'PingStatus', 'Values': ['Online'] }]
)
instances = [info['InstanceId'] for info in response['InstanceInformationList']]
This collects the list of EC2 instance IDs that we will check. (We assume the SSM agent is installed on your instances – which is true for most modern AWS Linux/Windows AMIs – otherwise SSM can’t run commands on them.)
Checking CloudWatch Agent Status with SSM Run Command
For each instance, the Lambda uses SSM Run Command to execute a pre-built document called “AmazonCloudWatch-ManageAgent”. AWS provides this Systems Manager document to manage the CloudWatch Agent (install, configure, or query its status). We’ll invoke it with the “status” action, which tells the agent to report its current status . In effect, this SSM command asks the CloudWatch Agent (via the SSM agent on the instance) whether it’s running, and if so, returns details like the running status and version.
The Lambda uses ssm.send_command for this. For example:
cmd_response = ssm.send_command(
InstanceIds=[instance_id],
DocumentName='AmazonCloudWatch-ManageAgent',
Parameters={
'action': ['status'],
'mode': ['ec2']
}
)
command_id = cmd_response['Command']['CommandId']
A few notes on this command: We specify the action as “status” and mode “ec2” (since these are EC2 instances, not on-premises). We target one instance at a time here by ID (you could target multiple in one command, but handling results is simpler per instance). The response gives us a CommandId which we’ll use to retrieve the execution output.
Under the hood, the AmazonCloudWatch-ManageAgent document will run the amazon-cloudwatch-agent-ctl command on the instance to get the agent status. If the CloudWatch Agent is running, the output will be a small JSON snippet indicating "status": "running" along with the start time and version . If the agent is stopped (not running), the JSON will say "status": "stopped" . In cases where the agent is not installed at all, the SSM command might report an error or simply that the service isn’t running (which effectively is the same outcome — it’s not running). We’ll handle those cases as “not running” as well, since either way the instance isn’t being monitored by the agent.
Retrieving and Parsing the Command Results
SSM Run Command is asynchronous, so after sending the command we need to retrieve the results. We use ssm.get_command_invocation with the Command ID and instance ID to get the output. One important detail here: the AmazonCloudWatch-ManageAgent document may consist of multiple steps/plugins internally, so we should specify the Plugin Name corresponding to the status action when fetching the results. Otherwise, the API might throw an “InvalidPluginName” error if it doesn’t know which step’s output to return . In our case, the plugin (step) name is “status” (since we invoked the status action).
So, the Lambda will do something like:
# (It’s a good practice to wait a few seconds or poll until the command is finished)
result = ssm.get_command_invocation(
CommandId=command_id,
InstanceId=instance_id,
PluginName='status' # specify the 'status' step output
)
output_text = result.get('StandardOutputContent', '')
The StandardOutputContent will contain the JSON string output from the agent status command. For example, it might be:
{ "status": "running", "starttime": "2025-04-01T12:00:00", "version": "1.300257.0" }
We parse this JSON in the Lambda (e.g., using Python’s json.loads) to easily inspect the fields:
import json
if output_text:
data = json.loads(output_text)
agent_status = data.get('status', 'unknown')
else:
agent_status = 'unknown'
Now, for each instance we have agent_status which will be "running" if the CloudWatch Agent is OK. If the agent is stopped or not installed, we might get "stopped" or no output. We treat any status other than “running” as a problem that needs alerting. (If the SSM command itself failed to execute, we also consider that as the agent not running, since we couldn’t confirm it’s active.)
We can also grab the agent version from the output (the version field) if we want to include it in the report. This could be useful to see what version is running or if an outdated version might be an issue.
Formatting the Markdown Alert and Sending Email via SES
Once the Lambda has checked all instances, it will compile a list of any instances that need attention (i.e., where the CloudWatch Agent is missing or stopped). The alert email will be composed in Markdown format for clarity. For example, the message body might look like:
i-0123456789abcdef (WebServer1) – CloudWatch Agent is STOPPED (not running)
i-0fedcba9876543210 (DatabaseServer) – CloudWatch Agent is NOT INSTALLED
Each bullet highlights the instance (by ID and maybe Name tag if we fetch it via EC2 API for friendliness) and the issue. We use bold text and other Markdown features to make it easy to read in the email. In our Python code, we might assemble this as a string with newline-separated - list items.
Finally, the Lambda uses Amazon SES to send out the email. We can use the ses.send_email API, specifying the From address (which must be a verified SES identity) and the To address(es) for the recipients. We put our markdown-formatted message in the email body. Typically, we send it as a simple text email (many email clients won’t render Markdown, but the formatting ensures it’s still human-readable). Optionally, we could convert the Markdown to HTML and send an HTML email for nicer formatting, but that adds complexity – sending it as plain text Markdown is straightforward and effective.
For example:
ses = boto3.client('ses')
email_body = "## CloudWatch Agent Alert\nThe following instances have issues:\n" + "\n".join(problem_lines)
ses.send_email(
Source=ALERT_FROM_ADDRESS,
Destination={'ToAddresses': [ALERT_TO_ADDRESS]},
Message={
'Subject': {'Data': '⚠️ AWS CloudWatch Agent Alert'},
'Body': {'Text': {'Data': email_body}}
}
)
In the above snippet, problem_lines is a list of strings like the bullet points shown earlier. We included a warning emoji in the subject for visibility, and used a Markdown header “## CloudWatch Agent Alert” in the body as a title. You can customize the content as you see fit (include timestamps, agent versions, suggestions to reinstall, etc.).
Note: Before the Lambda can actually send emails, you’ll need to verify the sender email (or domain) in SES and possibly the recipient as well (if your SES is in sandbox mode). We’ll touch on that in the deployment steps, but it’s an important prerequisite to avoid email delivery issues.
With the Lambda function logic explained, let’s move on to deploying this setup using AWS CDK for a clean, infrastructure-as-code deployment.
Step-by-Step Deployment with AWS CDK (TypeScript)
We will use the AWS Cloud Development Kit (CDK) in TypeScript to deploy the Lambda function, its scheduling, and the necessary permissions. This allows us to define the entire stack in code and easily repeat it in different environments. Below are the main steps:
Define the Lambda Function and Code: In your CDK application, create a Lambda function resource. For example, use new lambda.Function(...) in your Stack, specifying the runtime (Python 3.x), handler (the entry point in your Python code), and code (pointing to the directory or file with your Lambda code). Include any necessary environment variables for the function. Common env vars might be the ALERT_EMAIL_TO (recipient address) and ALERT_EMAIL_FROM (sender address), and perhaps a filter tag for instances if you want that configurable. For instance:
const monitorFn = new lambda.Function(this, 'AgentMonitorFunction', { runtime: lambda.Runtime.PYTHON_3_9, handler: 'index.handler', code: lambda.Code.fromAsset(path.join(__dirname, '../lambda')), // your code directory environment: { ALERT_EMAIL_TO: 'ops-team@example.com', ALERT_EMAIL_FROM: 'no-reply@mycompany.com' // ... any other configuration } });
Make sure the ALERT_EMAIL_FROM is an address or domain verified in SES. (You can verify emails via the SES console or CLI; CDK won’t auto-verify it for you.)
Assign IAM Permissions to the Lambda: The function needs permissions to use SSM, EC2 (optional), and SES. You can attach these permissions by adding IAM policy statements or using managed policies:
SSM Permissions: Allow actions like ssm:DescribeInstanceInformation, ssm:SendCommand, and ssm:GetCommandInvocation. You can scope the SendCommand permission to the specific SSM document ARN for AmazonCloudWatch-ManageAgent if you like, or use a broader permission (for simplicity, many will just allow ssm:* on resources *, but least privilege is recommended). These let the Lambda list instances and execute the status commands.
EC2 Permissions: If your code looks up EC2 instance tags (e.g., to get the Name tag for friendlier alerts), allow ec2:DescribeInstances (or ec2:DescribeTags). This is optional but useful for enriching alert info.
SES Permissions: Allow ses:SendEmail (or ses:SendRawEmail) on your SES identity. You can scope it to the Resource of your SES identity ARN. This permission enables the Lambda to actually send the email.
In CDK, you can attach a policy like:
monitorFn.addToRolePolicy(new iam.PolicyStatement({ actions: [ "ssm:DescribeInstanceInformation", "ssm:SendCommand", "ssm:GetCommandInvocation", "ec2:DescribeInstances", "ses:SendEmail" ], resources: ["*"] }));
Here we grant access to the necessary actions across all resources for brevity. In a production environment, tighten the resources scope if possible (for example, restrict SES to your specific identity ARN, and SSM to the target instance ARNs or the document name). The Lambda’s execution role now has the needed powers to do its job.
Schedule the Lambda with EventBridge (CloudWatch Events): We want the Lambda to run periodically (for example, once a day or every hour, depending on how quickly you want to catch issues). In CDK, create an EventBridge Rule to trigger the Lambda on a schedule. For example:
const rule = new events.Rule(this, 'ScheduleRule', { schedule: events.Schedule.cron({ minute: '0', hour: '*/6' }) // every 6 hours, for instance }); rule.addTarget(new targets.LambdaFunction(monitorFn));
This will invoke our monitorFn Lambda on the defined schedule (here it’s every 6 hours; you can adjust cron or use Schedule.rate(Duration.days(1)) for daily, etc.). CDK will handle the permissions so EventBridge can invoke the Lambda. By scheduling it, we ensure the CloudWatch Agent check runs regularly without human intervention.
Deploy and Verify: Synthesize and deploy the CDK stack (cdk deploy). Once deployed, check the AWS Console:
Verify that the Lambda function is created, and the environment variables are set correctly.
Verify that the EventBridge rule is in place and targeting the Lambda.
In the SES Console, make sure the ALERT_EMAIL_FROM address (or its domain) is verified (you should have done this before deployment or you can do it now). If you are in the SES sandbox, also verify the ALERT_EMAIL_TO recipient or move out of sandbox to send to arbitrary emails.
You can run a quick test by manually invoking the Lambda (e.g., via the Lambda console or CLI) to see if it sends an email. Check your inbox (including spam) for the alert message. It might say that all instances are fine (if none were stopped) or list any issues it found.
Operational Considerations: After deployment, your automated monitoring is in place. Going forward, whenever the CloudWatch Agent is not running on an instance, the Lambda will fire an email alert to your team. Ensure that your team knows how to respond (e.g., reinstall or start the agent on the affected instance). You might also consider integrating the alert with a ticketing system or an SNS topic (instead of direct SES emails) if that suits your operations better. The solution is highly customizable – for example, you could extend the Lambda to automatically attempt to restart the agent by running the SSM document with action: restart when it detects an issue, in addition to sending an alert.
Conclusion
By implementing this automated monitoring, you gain peace of mind that your CloudWatch Agents are continuously monitored just like the rest of your infrastructure. The Lambda + SSM approach effectively asks each instance “Hey, is your CloudWatch Agent OK?” on a schedule, and immediately notifies you if the answer is no. This proactive alerting brings real-world benefits: you’ll catch missing or crashed agents early, before they lead to missing metrics or logs during a crucial moment. In practice, this means more reliable monitoring, faster troubleshooting, and a more robust AWS environment.
In summary, we covered why CloudWatch Agent is important for EC2 monitoring and why you should alert on any gaps. We then built a Lambda function that checks agent status using SSM (leveraging the same AWS-recommended commands you could run manually ) and sends out markdown-styled email reports. Finally, we deployed the whole stack using AWS CDK, making it easy for DevOps and platform engineers to set up in their own accounts.
Real-world motivation: Think of this as a watchdog for your watchdog. It’s a simple investment of time that can pay off big by ensuring your monitoring infrastructure remains healthy. No one wants to discover during an outage that the reason you have no metrics is because the monitoring agent was down – with this solution, such surprises are a thing of the past. By implementing CloudWatch Agent alerting, you’re moving your ops culture toward one of preventative monitoring and greater reliability.
Subscribe to my newsletter
Read articles from Pawan Sawalani directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
