Over the past few days, I’ve been following Lucy Wang’s AWS Cloud Project to Become a Cloud Engineer to strengthen my cloud expertise and develop Cloud Support Engineer skills. This is the third project in the series, where I’m sharing my learnings and implementation.

Overview of Project

CloudGuard, a financial services firm, recently suffered a security breach after their operations team failed to identify abnormal system activity in time. This led to extended downtime and possible data exposure. To avoid similar incidents, management has made proactive monitoring and automated remediation a top priority. The solution involves building a comprehensive monitoring and auto-remediation system using AWS CloudWatch, Lambda, and GuardDuty. This setup will automatically detect and address both performance issues and security threats across development and production environments. By implementing this project, I have gained hands-on experience with AWS monitoring tools and security incident response—key skills for any cloud support professional

What actually we are doing

Simulate a system issue in a Controlled environment and setup cloud watch alarms and Lambda Auto Response.
Simulating Secuirty threat on CloudGuard Security Posture using Nmaps scan and implementing remediation action.

Simulate a system issue in a Controlled environment and set CloudWatch alarms and Lambda Auto Response

**On the Dev Instance: The 'stress' Tool**

The stress tool is a workload generator used to put artificial load on CPU, memory, I/O, and disk to test system performance and monitoring.

sudo yum install stress -y
sudo stress --cpu 8 --timeout 30

How to use:

Running above command will create 8 CPU-intensive worker processes for 5 minutes, easily pushing CPU utilization above 85% on a small instance like t2.micro.

Why it’s valuable:

Safely tests monitoring systems without real workloads
Automatically stops after the timeout to avoid wasted resources
Simulates realistic CPU spikes (e.g., runaway processes, DDoS attacks)

**On the Prod Instance: 'util-linux' Tools**

util-linux is a core Linux package that provides essential system utilities, including the fallocate command, which we’ll use to quickly create large files without actually writing data to disk.

sudo yum install util-linux -y
fallocate -l 6G /home/ec2-user/fakefile

How we’ll use it:
Running the above command will:

Instantly create a 6GB file in the home directory
Reserve disk space without physically writing data
Complete much faster than methods like dd
On a small t2.micro instance with an 8GB volume, drive disk usage past the 80% threshold

📌 Why it’s valuable:

Efficiently simulates low-disk scenarios without heavy I/O load
Avoids long-running file writes since allocation happens instantly
Mimics real-world issues such as log growth or malicious disk-filling activity

Safety Considerations:

CPU stress ends automatically once the timeout is reached
Disk space is freed instantly by deleting the test file
Neither tool causes a lasting impact on the system
Tests are designed to trigger alerts without disrupting services
Provide a controlled and safe method to validate monitoring and automated remediation against real-world scenarios

Installation of Cloud Watch Agent on EC2

CloudWatch Agent is installed on both EC2 instances to collect custom metrics as it helps monitor critical metrics like memory usage, disk space, and detailed system performance data that are essential for comprehensive monitoring.

sudo yum install amazon-cloudwatch-agent -y
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

Go through interactive configuration process on the wizard.

Now,

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -shttps://lwfiles.mycourse.app/67ed1067042dc73b07d76036-public/6426f8e996500eec02e210bed7705bb9.png

sudo systemctl status amazon-cloudwatch-agent

Set Up CloudWatch Alarms and Lambda Auto-Response

Step 1: Set Up CloudWatch Alarms

🔹Setting Up a High CPU Usage Alarm on Dev Instance

Create an alarm and select metrics as CPU Utilization
Set metric to Average with a 1-minute period.
Define condition: trigger when CPU is ≥ 85% (since this level may impact performance).
Configure actions: create an SNS topic (EC2-Alarms), add notification emails, and confirm via AWS verification email.
Name the alarm DevInstance-HighCPU, add a description if needed

👉 The alarm will notify you whenever your Dev instance’s CPU usage remains above 85% for 1 minute.

🔹Setting Up a Low Disk Alarm on Prod Instance

💡

CPU utilization is a default EC2 metric in CloudWatch, but low disk space isn’t. Therefore, we need to verify that CloudWatch Agent is properly installed and configured on the EC2 instance.

✅ To Check

sudo tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log

⚠️The key error message is:

💡

This issue occurs when the CloudWatch Agent lacks permissions to publish metrics to CloudWatch. To fix it, create a new IAM role with the CloudWatchAgentServerPolicy attached and assign it to the Prod EC2 instance.

✅Tip: Ensure the CloudWatch configuration file for disk metrics is applied, then restart the CloudWatch Agent to start collecting disk usage data.

Create an alarm and in metrics browser, search for CWAgent, then select device, fstype, host, path.

Filter by disk_used_percent and choose your Prod instance.
Configure metric: Average, 1-minute period.
Set condition: Static threshold ≥ 80% for disk_used_percent.
- This level is chosen because performance often degrades above 80%, giving you time to act before reaching critical capacity.
Actions: Trigger In alarm, send notification via EC2-Alarms SNS topic.

Step 2: Create an Automated Response with Lambda

Create a Lambda function with a new execution role, attach the AmazonEC2ReadOnlyAccess policy, and add an inline policy to allow EC2 instance tagging.

👉 Why these permissions? The Lambda function only needs to read EC2 details and apply tags for tracking. Granting minimal access follows the principle of least privilege.

Develop a Lambda function that processes SNS alarm notifications, determines the impacted EC2 instance and the type of issue, and applies an “Issue” tag with the value “HighCPU” or “LowDisk” accordingly.

Step 3: Connecting Lambda to SNS

Subscribing your Lambda function to the SNS topic allows CloudWatch alarm notifications to be sent as messages to the function, which then extracts the alarm details and name to determine the nature of the issue.

💡

This Lambda function demonstrates how we can automatically respond to system issues as soon as they're detected, without requiring manual intervention

Monitoring System Issues

Triggering a High CPU Utilization Event on Dev Instance

Let’s generate the load using the stress tool and artificially create high CPU utilization

Response in CloudWatch Console

Alarm notification on Email

Verify Tagging

Triggering a Low Disk Space Event

Create a large file to quickly fill up disk space using the fallocate command.

Response seen in Cloud Watch Console

Email notification for alarm

Key Takeaways

By simulating system events, we verified that:

CloudWatch alarms successfully detect critical conditions.
SNS topics deliver notifications to email recipients as expected.
Lambda functions are triggered by alarm state changes.
Auto-remediation code runs correctly and applies issue tags to resources.
EC2 tagging provides an audit trail for tracking incidents.

👉 With this setup, CloudGuard has achieved proactive monitoring with automated responses. Instead of discovering problems after damage occurs (as in the past breach), the team now gets immediate alerts and automatic tagging for faster tracking and investigation.

Setting Up AWS GuardDuty and Simulating Security Threats for CloudGuard

We are simulating a security threat on CloudGuard by running nmap scans. When used aggressively, this activity can resemble malicious behavior, providing a realistic scenario for proactive monitoring. To strengthen CloudGuard’s security posture, we enable AWS GuardDuty, which offers automated threat detection and helps identify suspicious activity early.

sudo nmap -Pn -p 1-1000 -T4 -A [TARGET-EC2-IP]

💡

Here Dev instance performs an aggressive port scan against your Prod instance

1. Analyzing the GuardDuty Finding

Under the Port Scanning Detection

Critical Details

Finding Type: Recon:EC2/PortProbeUnprotectedPort (indicates reconnaissance activity)
Severity Level: MEDIUM (requires investigation but not immediate emergency response)

Source IP: 18.234.62.77 (identifies the origin of the scan)
Target Instance ID: 54.91.149.5 Q (identifies the affected resource)
Ports Scanned: 1-1000 (indicates breadth of reconnaissance)

2. Investigating the Root Cause

Examine system logs on the affected Prod instance

  Check SSH connection logs
  sudo journalctl -u sshd | grep -i "connect"

Result:

The instance is verifying SSH keys via EC2 Instance Connect when someone tries to log in.
The lines show external IPs attempting to connect to your EC2 instance like Connection from 3.131.215.38 port 41482, which my nmap has triggered tcp handshake to check port is open and the connection is closed by remote host.

    Check auth-related logs
    sudo journalctl -t sshd

Result:

It shows successful logins using the EC2 key pair (public key authentication).
There are failed or malicious connection attempts.kex_exchange_identification errors: the client () sent malformed SSH handshake data. Often seen during port scans (e.g., nmap) or botnet probing.

Invalid user → attacker tried logging in with a non-existent username.

Lets, review Security Groups to identify which ports are intentionally accessible

Conclusion

The analysis of SSH and authentication logs confirms that the observed connection attempts were caused by Nmap scanning the instance. The TCP handshake attempts, kex_exchange_identification errors, and connections from external IPs correspond to typical port-scan behavior rather than unauthorized login attempts. All successful logins were via the expected EC2 key pair, verifying that this activity was part of a controlled testing scenario. Investigation has also confirmed the source.

3. Implementing Remediation Actions

While Security Groups provide stateful filtering at the instance level, Network Access Control Lists (NACLs) offer an additional layer of stateless network security at the subnet level.

Implement a NACL to block suspicious traffic

Validate the remediation again by attempting another scan

🌟How is implementing NACL going to help here?

Block unwanted IPs: We can deny traffic from suspicious IP addresses that triggered the port scans.
Restrict access to critical ports: Even if a Security Group allows SSH (port 22) or other services, the NACL can add another layer of filtering.
Mitigate reconnaissance: NACLs can limit or block traffic patterns typically used by scanning tools like Nmap.

Key Takeaways

Layered Security: Employing multiple security measures—such as GuardDuty, Security Groups, and NACLs—offers stronger protection than relying on a single control.
Swift Action: Minimizing the time between detection and remediation is crucial, as every minute of delay increases potential risk.
Successfully reviewed system logs for evidence of the scan and validated remediation effectiveness with follow-up test.

Future Recommendation

In the future, we can enhance the lambda function to perform automated remediation actions such as restarting services or scaling resources.
Improve Segregation by using separate VPCs with a controlled transit gateway connection and implement stricter network segmentation between Dev and Prod environments.
Advanced Security Integration: Leverage AWS Security Hub, develop incident runbooks, and provide security training for all CloudGuard engineers.
Enhance Detection: Use CloudWatch Event rules to trigger Lambda for automatic remediation on GuardDuty alerts and leverage AWS Config to continuously audit security group configurations.
Documentation of the Incident report is crucial to maintain the improve CloudGuard Security Posture.

Proactive Monitoring & Security Auto-Remediation for EC2

Table of contents

Overview of Project

Simulate a system issue in a Controlled environment and set CloudWatch alarms and Lambda Auto Response

**On the Dev Instance: The 'stress' Tool**

**On the Prod Instance: 'util-linux' Tools**

Installation of Cloud Watch Agent on EC2

Set Up CloudWatch Alarms and Lambda Auto-Response

Step 1: Set Up CloudWatch Alarms

Step 2: Create an Automated Response with Lambda

Step 3: Connecting Lambda to SNS

Monitoring System Issues

Triggering a High CPU Utilization Event on Dev Instance

Triggering a Low Disk Space Event

Key Takeaways

Setting Up AWS GuardDuty and Simulating Security Threats for CloudGuard

1. Analyzing the GuardDuty Finding

Under the Port Scanning Detection

2. Investigating the Root Cause

Examine system logs on the affected Prod instance

Conclusion

3. Implementing Remediation Actions

Key Takeaways

Future Recommendation

Subscribe to my newsletter

Sirsha Thapa

Sirsha Thapa