Introduction to AWS CloudWatch

What is AWS CloudWatch?

AWS CloudWatch is a monitoring and observability service that helps you track your AWS resources' performance and operational health.

You can think of CloudWatch as a "gatekeeper" that watches activities in your AWS cloud. It collects and tracks metrics, logs, and alarms to ensure your infrastructure runs smoothly.

Problems CloudWatch Solves

Real-time Monitoring – Helps track AWS resource usage.
Automatic Alarming – Notifies you when thresholds are exceeded.
Complex Log management– Stores and analyzes logs for troubleshooting.
Custom Metrics – Enables tracking of non-default AWS metrics.
Cost Optimization – Identifies underutilized resources. Helps optimize resources and reduce spending..
Scaling Integration – Works with AWS Auto Scaling to adjust resources dynamically.
Lack of visibility – Tracks AWS resource performance.
No real-time monitoring needed– Alerts you when resources exceed thresholds.

Key Features of AWS CloudWatch

Feature 1: Monitoring

Tracks AWS services & applications.
Examples:
- CPU utilization of an EC2 instance.
- Number of API calls in AWS Lambda.
- Memory consumption of a running application.

Feature 2: Metrics (Default & Custom)

Metric = Data point that represents resource performance. Real-time performance indicators like CPU usage, memory consumption, and API requests.
Default Metrics (Auto-generated by AWS):
- EC2 CPU utilization
- S3 bucket request count
- Lambda function invocation count
Custom Metrics (User-defined):
- Memory consumption (not auto-tracked)
- API response time
- Application error rates

Feature 3: Alarms

Alarms notify users when a metric crosses a threshold. Notifications triggered when a metric crosses a predefined threshold.
Example:
- If CPU utilization exceeds 80%, send an email notification.
- If free memory drops below 500MB, trigger autoscaling.

Feature 4: Log Insights

Collects & analyzes logs from AWS services. Stores logs from AWS services for debugging and auditing.
Example:
- Logs failed login attempts to EC2.
- Tracks API calls made to S3 buckets.

Feature 5: Cost Optimization

Helps reduce AWS costs by identifying underused resources.
Example:
- Finds idle EC2 instances and shuts them down.
- Identifies over-provisioned RDS databases.

Feature 6: Scaling & Automation

Integrates with AWS Auto Scaling to adjust resources automatically.
Example:
- If CPU utilization exceeds 70%, add more EC2 instances.
- If API request rate drops below 100 per second, reduce resources.
Customizable visual representation of monitored metrics with dashboards.
Automated responses to changes in your AWS environment with events.

Understanding Metrics and Alarms

What Are Metrics?

Metrics provide data points on AWS resource performance. Some common CloudWatch metrics include:

EC2 Instance CPU Utilization (Percentage of CPU usage)
Memory Utilization (Custom metric, as AWS does not track this by default)
API Requests (Number of API calls made to a service)

What Are Alarms?

Alarms help automate responses to metric changes. For example, if CPU usage exceeds 80%, CloudWatch can:

Send an email notification.
Trigger an Auto Scaling event to add instances.
Restart the EC2 instance.

Hands-on Demonstration #1: Configuring a CloudWatch Alarm for EC2 CPU Utilization

First create an EC2 instance with t2.micro and we will manually spike the CPU for demo purpose to see Realtime CPU Spikes.

By default EC2 instance send the metrices in span of 5 min, but we will change to 1 min here’s how:

Select the EC2 instance > Monitoring > Manage Detailed Monitoring

Just Enable

Create a file for CPU Spikes manually in your EC2: vim cpu_spike.py

import time

def simulate_cpu_spike(duration=30, cpu_percent=80):
    print(f"Simulating CPU spike at {cpu_percent}%...")
    start_time = time.time()

    # Calculate the number of iterations needed to achieve the desired CPU utilization
    target_percent = cpu_percent / 100
    total_iterations = int(target_percent * 5_000_000)  # Adjust the number as needed

    # Perform simple arithmetic operations to spike CPU utilization
    for _ in range(total_iterations):
        result = 0
        for i in range(1, 1001):
            result += i

    # Wait for the rest of the time interval
    elapsed_time = time.time() - start_time
    remaining_time = max(0, duration - elapsed_time)
    time.sleep(remaining_time)

    print("CPU spike simulation completed.")

if __name__ == '__main__':
    # Simulate a CPU spike for 30 seconds with 80% CPU utilization
    simulate_cpu_spike(duration=30, cpu_percent=80)

Run the script

 python3 cpu_spike.py

As you can see our CPU Spike has increased: Below is EC2 Instance > Monitor

See in Cloudwatch too

Our EC2 has spiked up to 100% with our python script! Successfully.

Note: We can change the graph view mode in different types too.

Let’s Note Some Points

If in a organisation our CPU Utilization is constantly hitting up to 70%-80%-90% with average time of 5-10min then we set the alarm, because that’s not a good indication for ongoing application. And also if CPU hits 100% for 1 sec that’s not a major issue but yes it’s a issue as well.

Let's Define Alarm Conditions

This alarm will notify when something goes bad in your EC2 instances or other services.

If issue is very critical you will get the notification in your mobile phone and you fix the problem instantly.

Go to your Cloud watch > Alarms > In alarm > Create alarm

Select Metrics to setup your alarm

Select statistic to average (ideal for production)

Note: for demo purpose we will keep statistic to max (cause we can’t wait for longer period of time)

Maximum with 1 min for demo purpose

If CPU reaches equal or greater than 50% we will get an alarm.

Write your alarm name and message

Check it’s created

Our alarm isn’t activated, you need to go to your E-mail and activate the service. And we used the SNS Service for sending E-mails through AWS.

The alarm is now active and will trigger when CPU usage exceeds 50%.

As you can see now state is active and ready to send notification.

Since our Alarm is activated: Trigger the CPU Spike with python program

Go to EC2 Instance
```
  python3 cpu_spike.py
```

That red line in graph above is our limit to trigger the notification

Since now we have just explored the 1 metrics = CPU utilization but there 1036 metrics available, you can analyze how powerful it is.

Note: Dashboard in Cloudwatch is also same for tracking purpose, it just allow you to build good dashboard and you can calculate the group metrics inside that.

AWS CloudWatch Guide and EC2 CPU ALERTING THROUGH SNS