Proactive Monitoring & Security Auto-Remediation for EC2


Over the past few days, I’ve been following Lucy Wang’s AWS Cloud Project to Become a Cloud Engineer to strengthen my cloud expertise and develop Cloud Support Engineer skills. This is the third project in the series, where I’m sharing my learnings and implementation.
Overview of Project
CloudGuard, a financial services firm, recently suffered a security breach after their operations team failed to identify abnormal system activity in time. This led to extended downtime and possible data exposure. To avoid similar incidents, management has made proactive monitoring and automated remediation a top priority. The solution involves building a comprehensive monitoring and auto-remediation system using AWS CloudWatch, Lambda, and GuardDuty. This setup will automatically detect and address both performance issues and security threats across development and production environments. By implementing this project, I have gained hands-on experience with AWS monitoring tools and security incident response—key skills for any cloud support professional
What actually we are doing
Simulate a system issue in a Controlled environment and setup cloud watch alarms and Lambda Auto Response.
Simulating Secuirty threat on CloudGuard Security Posture using Nmaps scan and implementing remediation action.
Simulate a system issue in a Controlled environment and set CloudWatch alarms and Lambda Auto Response
On the Dev Instance: The 'stress' Tool
The stress
tool is a workload generator used to put artificial load on CPU, memory, I/O, and disk to test system performance and monitoring.
sudo yum install stress -y
sudo stress --cpu 8 --timeout 30
How to use:
Running above command will create 8 CPU-intensive worker processes for 5 minutes, easily pushing CPU utilization above 85% on a small instance like t2.micro.
Why it’s valuable:
Safely tests monitoring systems without real workloads
Automatically stops after the timeout to avoid wasted resources
Simulates realistic CPU spikes (e.g., runaway processes, DDoS attacks)
On the Prod Instance: 'util-linux' Tools
util-linux
is a core Linux package that provides essential system utilities, including the fallocate
command, which we’ll use to quickly create large files without actually writing data to disk.
sudo yum install util-linux -y
fallocate -l 6G /home/ec2-user/fakefile
How we’ll use it:
Running the above command will:
Instantly create a 6GB file in the home directory
Reserve disk space without physically writing data
Complete much faster than methods like
dd
On a small t2.micro instance with an 8GB volume, drive disk usage past the 80% threshold
📌 Why it’s valuable:
Efficiently simulates low-disk scenarios without heavy I/O load
Avoids long-running file writes since allocation happens instantly
Mimics real-world issues such as log growth or malicious disk-filling activity
Safety Considerations:
CPU stress ends automatically once the timeout is reached
Disk space is freed instantly by deleting the test file
Neither tool causes a lasting impact on the system
Tests are designed to trigger alerts without disrupting services
Provide a controlled and safe method to validate monitoring and automated remediation against real-world scenarios
Installation of Cloud Watch Agent on EC2
CloudWatch Agent is installed on both EC2 instances to collect custom metrics as it helps monitor critical metrics like memory usage, disk space, and detailed system performance data that are essential for comprehensive monitoring.
sudo yum install amazon-cloudwatch-agent -y
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
Go through interactive configuration process on the wizard.
Now,
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -shttps://lwfiles.mycourse.app/67ed1067042dc73b07d76036-public/6426f8e996500eec02e210bed7705bb9.png
sudo systemctl status amazon-cloudwatch-agent
Set Up CloudWatch Alarms and Lambda Auto-Response
Step 1: Set Up CloudWatch Alarms
🔹Setting Up a High CPU Usage Alarm on Dev Instance
Create an alarm and select metrics as CPU Utilization
Set metric to Average with a 1-minute period.
Define condition: trigger when CPU is ≥ 85% (since this level may impact performance).
Configure actions: create an SNS topic (EC2-Alarms), add notification emails, and confirm via AWS verification email.
Name the alarm DevInstance-HighCPU, add a description if needed
👉 The alarm will notify you whenever your Dev instance’s CPU usage remains above 85% for 1 minute.
🔹Setting Up a Low Disk Alarm on Prod Instance
✅ To Check
sudo tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log
⚠️The key error message is:
✅Tip: Ensure the CloudWatch configuration file for disk metrics is applied, then restart the CloudWatch Agent to start collecting disk usage data.
Create an alarm and in metrics browser, search for CWAgent, then select device, fstype, host, path.
Filter by disk_used_percent and choose your Prod instance.
Configure metric: Average, 1-minute period.
Set condition: Static threshold ≥ 80% for disk_used_percent.
- This level is chosen because performance often degrades above 80%, giving you time to act before reaching critical capacity.
Actions: Trigger In alarm, send notification via EC2-Alarms SNS topic.
Step 2: Create an Automated Response with Lambda
- Create a Lambda function with a new execution role, attach the AmazonEC2ReadOnlyAccess policy, and add an inline policy to allow EC2 instance tagging.
👉 Why these permissions? The Lambda function only needs to read EC2 details and apply tags for tracking. Granting minimal access follows the principle of least privilege.
- Develop a Lambda function that processes SNS alarm notifications, determines the impacted EC2 instance and the type of issue, and applies an “Issue” tag with the value “HighCPU” or “LowDisk” accordingly.
Step 3: Connecting Lambda to SNS
Subscribing your Lambda function to the SNS topic allows CloudWatch alarm notifications to be sent as messages to the function, which then extracts the alarm details and name to determine the nature of the issue.
Monitoring System Issues
Triggering a High CPU Utilization Event on Dev Instance
Let’s generate the load using the stress tool and artificially create high CPU utilization
Response in CloudWatch Console
Alarm notification on Email
Verify Tagging
Triggering a Low Disk Space Event
Create a large file to quickly fill up disk space using the fallocate command.
Response seen in Cloud Watch Console
Email notification for alarm
Key Takeaways
By simulating system events, we verified that:
CloudWatch alarms successfully detect critical conditions.
SNS topics deliver notifications to email recipients as expected.
Lambda functions are triggered by alarm state changes.
Auto-remediation code runs correctly and applies issue tags to resources.
EC2 tagging provides an audit trail for tracking incidents.
👉 With this setup, CloudGuard has achieved proactive monitoring with automated responses. Instead of discovering problems after damage occurs (as in the past breach), the team now gets immediate alerts and automatic tagging for faster tracking and investigation.
Setting Up AWS GuardDuty and Simulating Security Threats for CloudGuard
We are simulating a security threat on CloudGuard by running nmap scans. When used aggressively, this activity can resemble malicious behavior, providing a realistic scenario for proactive monitoring. To strengthen CloudGuard’s security posture, we enable AWS GuardDuty, which offers automated threat detection and helps identify suspicious activity early.
sudo nmap -Pn -p 1-1000 -T4 -A [TARGET-EC2-IP]
1. Analyzing the GuardDuty Finding
Under the Port Scanning Detection
Critical Details
Finding Type:
Recon:EC2/PortProbeUnprotectedPort
(indicates reconnaissance activity)Severity Level: MEDIUM (requires investigation but not immediate emergency response)
Source IP: 18.234.62.77 (identifies the origin of the scan)
Target Instance ID: 54.91.149.5 Q (identifies the affected resource)
Ports Scanned: 1-1000 (indicates breadth of reconnaissance)
2. Investigating the Root Cause
Examine system logs on the affected Prod instance
Check SSH connection logs
sudo journalctl -u sshd | grep -i "connect"
Result:
The instance is verifying SSH keys via EC2 Instance Connect when someone tries to log in.
The lines show external IPs attempting to connect to your EC2 instance like Connection from 3.131.215.38 port 41482, which my nmap has triggered tcp handshake to check port is open and the connection is closed by remote host.
Check auth-related logs sudo journalctl -t sshd
Result:
It shows successful logins using the EC2 key pair (public key authentication).
There are failed or malicious connection attempts.
kex_exchange_identification
errors: the client () sent malformed SSH handshake data. Often seen during port scans (e.g., nmap) or botnet probing.Invalid user
→ attacker tried logging in with a non-existent username.
Lets, review Security Groups to identify which ports are intentionally accessible
Conclusion
The analysis of SSH and authentication logs confirms that the observed connection attempts were caused by Nmap scanning the instance. The TCP handshake attempts, kex_exchange_identification
errors, and connections from external IPs correspond to typical port-scan behavior rather than unauthorized login attempts. All successful logins were via the expected EC2 key pair, verifying that this activity was part of a controlled testing scenario. Investigation has also confirmed the source.
3. Implementing Remediation Actions
While Security Groups provide stateful filtering at the instance level, Network Access Control Lists (NACLs) offer an additional layer of stateless network security at the subnet level.
Implement a NACL to block suspicious traffic
Validate the remediation again by attempting another scan
🌟How is implementing NACL going to help here?
Block unwanted IPs: We can deny traffic from suspicious IP addresses that triggered the port scans.
Restrict access to critical ports: Even if a Security Group allows SSH (port 22) or other services, the NACL can add another layer of filtering.
Mitigate reconnaissance: NACLs can limit or block traffic patterns typically used by scanning tools like Nmap.
Key Takeaways
Layered Security: Employing multiple security measures—such as GuardDuty, Security Groups, and NACLs—offers stronger protection than relying on a single control.
Swift Action: Minimizing the time between detection and remediation is crucial, as every minute of delay increases potential risk.
Successfully reviewed system logs for evidence of the scan and validated remediation effectiveness with follow-up test.
Future Recommendation
In the future, we can enhance the lambda function to perform automated remediation actions such as restarting services or scaling resources.
Improve Segregation by using separate VPCs with a controlled transit gateway connection and implement stricter network segmentation between Dev and Prod environments.
Advanced Security Integration: Leverage AWS Security Hub, develop incident runbooks, and provide security training for all CloudGuard engineers.
Enhance Detection: Use CloudWatch Event rules to trigger Lambda for automatic remediation on GuardDuty alerts and leverage AWS Config to continuously audit security group configurations.
Documentation of the Incident report is crucial to maintain the improve CloudGuard Security Posture.
Subscribe to my newsletter
Read articles from Sirsha Thapa directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
