AWS Cloud Monitoring Made Easy

Pratiksha kadamPratiksha kadam
13 min read

When your applications live in the cloud, visibility becomes your lifeline. Modern cloud architectures create complex interdependencies where a single failing component can cascade into system-wide outages.

You need to know what's happening across your AWS infrastructure before your users experience any impact. The challenge lies not just in collecting data, but in transforming that data into actionable insights that drive operational excellence.

Today, let's explore how to set up comprehensive monitoring using two industry leaders: Datadog and Splunk, and discover how their complementary strengths can create a robust observability strategy.

Why Monitor AWS Applications?

Cloud applications are distributed by nature, creating a complex web of interconnected services that span multiple availability zones, regions, and service boundaries.

Your web servers might be in one availability zone, your database in another, and your cache layer somewhere else entirely, each with its own performance characteristics and failure modes.

Without proper monitoring, troubleshooting becomes guesswork, leading to extended downtime and frustrated users. The ephemeral nature of cloud resources means instances can terminate unexpectedly, auto-scaling events can mask underlying issues, and microservices can fail silently while appearing healthy from the outside. You need metrics, logs, and traces working together to paint the complete picture of your application's health, performance bottlenecks, and user experience impact across the entire technology stack.

Getting Started with Datadog

Datadog excels at infrastructure monitoring and application performance tracking, offering a unified platform that correlates metrics, traces, and logs in real-time. Its strength lies in providing immediate visibility into system performance with minimal configuration overhead, making it ideal for fast-moving development teams. The platform's machine learning capabilities can automatically detect anomalies, predict capacity issues, and suggest optimization opportunities based on historical patterns. Here's how to get your AWS resources talking to Datadog and unlock comprehensive observability:

Step 1: Install the Datadog Agent

For AWS EC2 instances, the agent installation is straightforward:

DD_API_KEY=your_api_key bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"

Step 2: Configure AWS Integration

Create an IAM role with the necessary permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:List*",
        "cloudwatch:Get*",
        "ec2:Describe*",
        "support:*",
        "tag:GetResources"
      ],
      "Resource": "*"
    }
  ]
}

This integration automatically pulls metrics from CloudWatch, giving you visibility into EC2, RDS, Lambda, and other AWS services without additional configuration.

Step 3: Set Up Application Monitoring

For application-level insights, instrument your code with Datadog's APM:

from ddtrace import tracer
from ddtrace.contrib.flask import TraceMiddleware

app = Flask(__name__)
TraceMiddleware(app, tracer, service="my-web-app")

Leveraging Splunk for Log Analysis

While Datadog handles metrics beautifully, Splunk shines when you need deep log analysis and custom searches across massive datasets.

Step 1: Configure Log Forwarding

Set up the Splunk Universal Forwarder on your EC2 instances:

./splunkforwarder start --accept-license
./splunk add forward-server your-splunk-server:9997
./splunk add monitor /var/log/application/

Step 2: Create Custom Dashboards

Splunk's search language lets you create powerful queries:

index=application_logs level=ERROR 
| timechart span=5m count by source 
| eval threshold=50 
| where count > threshold

This query identifies error spikes across your applications in real-time.

Best Practices for AWS Monitoring

Tag Everything: Use consistent tagging across AWS resources. Both Datadog and Splunk can filter and group data based on tags, making troubleshooting much easier.

Set Meaningful Alerts: Don't alert on everything. Focus on metrics that directly impact user experience - response times, error rates, and business-critical processes.

Monitor the Full Stack: Track infrastructure metrics (CPU, memory), application metrics (response time, throughput), and business metrics (user signups, revenue) together.

Use Custom Metrics: Both platforms support custom metrics. Track what matters to your specific application - queue lengths, cache hit rates, or API call success rates.

Bringing It All Together

The real power comes from using both tools together. Datadog provides real-time infrastructure and application monitoring with beautiful visualizations. Splunk offers deep log analysis and complex event correlation.

When an alert fires in Datadog, you can immediately jump to Splunk to analyze the related log data. This combination gives you both the early warning system and the detailed forensic capabilities you need to maintain reliable cloud applications.

Remember: monitoring isn't about collecting every possible metric. It's about having the right information at the right time to make informed decisions about your applications' health and performance.

Advanced Configuration and Automation

To truly scale your monitoring setup, automation becomes essential. Let's explore advanced configurations and deployment scripts that will save you hours of manual work.

Automated Datadog Agent Deployment

Here's a comprehensive script to deploy Datadog agents across multiple EC2 instances:

#!/bin/bash
# datadog-deploy.sh - Mass deployment script for Datadog agents

DATADOG_API_KEY="your_api_key_here"
INSTANCE_IDS=("i-1234567890abcdef0" "i-0987654321fedcba0")

for instance_id in "${INSTANCE_IDS[@]}"; do
    echo "Deploying Datadog agent to instance: $instance_id"

    aws ssm send-command \
        --instance-ids "$instance_id" \
        --document-name "AWS-RunShellScript" \
        --parameters "commands=[
            'DD_API_KEY=$DATADOG_API_KEY bash -c \"\$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)\"',
            'sudo systemctl enable datadog-agent',
            'sudo systemctl start datadog-agent',
            'echo \"Datadog agent deployed successfully on $instance_id\"'
        ]" \
        --output text
done

Container Monitoring Setup

For containerized applications, monitoring becomes more complex. Here's a Docker Compose configuration that includes monitoring:

#!/bin/bash
# container-monitoring-setup.sh - Setup monitoring for containerized apps

cat > docker-compose.monitoring.yml << 'EOF'
version: '3.8'
services:
  datadog-agent:
    image: datadog/agent:latest
    environment:
      - DD_API_KEY=${DATADOG_API_KEY}
      - DD_SITE=datadoghq.com
      - DD_CONTAINER_EXCLUDE="name:datadog-agent"
      - DD_APM_ENABLED=true
      - DD_LOGS_ENABLED=true
      - DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /proc/:/host/proc/:ro
      - /opt/datadog-agent/run:/opt/datadog-agent/run:rw
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
    networks:
      - monitoring

  app:
    build: .
    environment:
      - DD_TRACE_AGENT_HOSTNAME=datadog-agent
      - DD_ENV=production
      - DD_SERVICE=my-web-app
    depends_on:
      - datadog-agent
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge
EOF

echo "Starting monitoring stack..."
docker-compose -f docker-compose.monitoring.yml up -d

CloudWatch Integration Script

Automate the setup of CloudWatch log groups and streams for Splunk ingestion:

#!/bin/bash
# cloudwatch-splunk-integration.sh - Setup CloudWatch to Splunk integration

SPLUNK_HEC_ENDPOINT="https://your-splunk-instance:8088/services/collector"
SPLUNK_HEC_TOKEN="your-hec-token"
LOG_GROUP_NAME="/aws/lambda/my-application"

# Create CloudWatch log group
aws logs create-log-group --log-group-name "$LOG_GROUP_NAME"

# Create subscription filter for Splunk
aws logs put-subscription-filter \
    --log-group-name "$LOG_GROUP_NAME" \
    --filter-name "SplunkSubscriptionFilter" \
    --filter-pattern "" \
    --destination-arn "arn:aws:lambda:us-east-1:123456789012:function:splunk-cloudwatch-processor"

# Create Lambda function for log processing
cat > lambda-function.py << 'EOF'
import json
import gzip
import base64
import requests

def lambda_handler(event, context):
    compressed_payload = base64.b64decode(event['awslogs']['data'])
    uncompressed_payload = gzip.decompress(compressed_payload)
    log_data = json.loads(uncompressed_payload)

    events = []
    for log_event in log_data['logEvents']:
        events.append({
            'time': log_event['timestamp'] / 1000,
            'event': log_event['message'],
            'source': log_data['logGroup'],
            'sourcetype': 'aws:cloudwatch'
        })

    headers = {
        'Authorization': f'Splunk {SPLUNK_HEC_TOKEN}',
        'Content-Type': 'application/json'
    }

    for event in events:
        response = requests.post(SPLUNK_HEC_ENDPOINT, 
                               headers=headers, 
                               data=json.dumps(event))

    return {'statusCode': 200}
EOF

echo "CloudWatch to Splunk integration configured successfully"

Health Check and Alerting Script

Create comprehensive health checks across your monitoring infrastructure:

#!/bin/bash
# monitoring-health-check.sh 
- Verify monitoring stack health

DATADOG_API_KEY="your_api_key"
DATADOG_APP_KEY="your_app_key"
SPLUNK_ENDPOINT="https://your-splunk-instance:8089"
SPLUNK_TOKEN="your_splunk_token"

echo "=== Monitoring Stack Health Check ==="

# Check Datadog API connectivity
datadog_status=$(curl -s -o /dev/null -w "%{http_code}" \
    -H "DD-API-KEY: $DATADOG_API_KEY" \
    -H "DD-APPLICATION-KEY: $DATADOG_APP_KEY" \
    "https://api.datadoghq.com/api/v1/validate")

if [ "$datadog_status" -eq 200 ]; then
    echo "✓ Datadog API: Connected"
else
    echo "✗ Datadog API: Connection failed (HTTP $datadog_status)"
fi

# Check Splunk connectivity
splunk_status=$(curl -s -o /dev/null -w "%{http_code}" \
    -H "Authorization: Bearer $SPLUNK_TOKEN" \
    "$SPLUNK_ENDPOINT/services/server/info")

if [ "$splunk_status" -eq 200 ]; then
    echo "✓ Splunk API: Connected"
else
    echo "✗ Splunk API: Connection failed (HTTP $splunk_status)"
fi

# Check agent status on local machine
if systemctl is-active --quiet datadog-agent; then
    echo "✓ Datadog Agent: Running"
else
    echo "✗ Datadog Agent: Not running"
fi

# Check disk space for logs
log_disk_usage=$(df /var/log | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$log_disk_usage" -lt 80 ]; then
    echo "✓ Log Disk Usage: ${log_disk_usage}% (healthy)"
else
    echo "⚠ Log Disk Usage: ${log_disk_usage}% (warning)"
fi

echo "=== Health Check Complete ==="

Monitoring Cost Optimization

AWS monitoring can become expensive quickly. Here are strategies to optimize costs while maintaining visibility:

Selective Metric Collection: Don't collect every available metric. Focus on business-critical indicators and use Datadog's metric filters to reduce ingestion costs.

Log Retention Policies: Implement intelligent log retention. Keep recent logs in real-time systems and archive older data to S3 for compliance.

Sampling Strategies: For high-traffic applications, implement trace sampling in Datadog APM. Sample 10% of traces instead of 100% to reduce costs while maintaining statistical significance.

Troubleshooting Common Issues

Agent Connection Problems: If agents can't reach Datadog, check security groups and NACLs. The agent needs outbound HTTPS access to Datadog's endpoints.

Missing Metrics: Verify IAM permissions for the CloudWatch integration. The integration role needs proper read permissions for all services you want to monitor.

Log Ingestion Delays: Splunk Universal Forwarders can fall behind during high log volume. Configure proper indexer clustering and monitor forwarder queues.

Scaling Your Monitoring Strategy

As your AWS infrastructure grows, your monitoring needs will evolve. Plan for:

Multi-Account Strategy: Use Datadog's AWS integration across multiple accounts. Set up cross-account roles for centralized monitoring.

Custom Metrics: Develop application-specific metrics that align with business objectives. Monitor user experience, not just system performance.

Automated Remediation: Integrate monitoring alerts with AWS Lambda or Systems Manager for automatic incident response.

The key to successful cloud monitoring is continuous iteration. Start with basic coverage, then expand based on your team's actual troubleshooting patterns and business requirements.

To address log ingestion delays in Splunk due to high log volume from Universal Forwarders, follow these practical steps:


✅ 1. Implement Indexer Clustering

Ensure your Splunk deployment can scale with incoming data volume.

  • Enable indexer clustering (search head + multiple indexers with replication).

  • This distributes load, improves fault tolerance, and speeds up indexing.

Steps:

  1. In server.conf on each indexer:

     [clustering]
     mode = slave
     master_uri = https://<cluster-master>:8089
     replication_port = 9887
    
  2. Restart Splunk on indexers and connect them to the cluster master node.


✅ 2. Monitor Forwarder Queues

Universal Forwarders use internal queues (parsingQueue, aggregationQueue, etc.)—monitor them to detect bottlenecks.

Enable queue monitoring:

  1. Edit inputs.conf on the indexer:

     [splunktcp://9997]
     queue = parsingQueue
    
  2. Search internal logs to visualize queues:

     index=_internal source=*metrics.log group=queue
     | timechart avg(current_size) by name
    

✅ 3. Tune Forwarder Settings

Improve buffer and throughput settings on Universal Forwarders.

  • Modify outputs.conf:

      [tcpout]
      maxQueueSize = 512MB
      autoLBFrequency = 30
      sendCookedData = true
    
  • Consider increasing pipeline batch sizes or buffering thresholds.


✅ 4. Enable Load Balancing

If you have multiple indexers, ensure Universal Forwarders load balance across them.

ini[tcpout:my_indexers]
server = indexer1:9997,indexer2:9997,indexer3:9997
autoLBFrequency = 30

✅ 5. Review and Reduce Noise

High log volume often includes excessive noise.

  • Use transforms.conf to filter logs at the forwarder before they’re sent.

  • Only forward necessary logs, especially during peak hours.


✅ 6. Scale Horizontally

If ingestion volume continues to rise:

  • Add more indexers to distribute load.

  • Use heavy forwarders for pre-processing if transformation is needed.


To connect to a Splunk instance using the Splunk REST API, queries internal metrics logs, and monitors queue delays (e.g., parsingQueue, indexingQueue) from Universal Forwarders in real time.


Python Script: Splunk Queue Monitoring

import requests
import urllib3
import json
import time

# Suppress SSL warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# === Splunk API Configuration ===
SPLUNK_HOST = "https://your-splunk-host:8089"
USERNAME = "admin"
PASSWORD = "your_password"
QUERY = """search index=_internal source=*metrics.log group=queue | stats avg(current_size) by name"""

# === Authenticate to Splunk ===
def get_session_key():
    url = f"{SPLUNK_HOST}/services/auth/login"
    data = {"username": USERNAME, "password": PASSWORD}
    response = requests.post(url, data=data, verify=False)
    if response.status_code != 200:
        raise Exception(f"Login failed: {response.text}")
    session_key = response.text.split("<sessionKey>")[1].split("</sessionKey>")[0]
    return session_key

# === Run Search Job ===
def run_search(session_key, query):
    headers = {
        "Authorization": f"Splunk {session_key}"
    }
    search_url = f"{SPLUNK_HOST}/services/search/jobs"
    data = {
        "search": query,
        "exec_mode": "blocking",
        "output_mode": "json"
    }

    response = requests.post(search_url, headers=headers, data=data, verify=False)
    if response.status_code != 200:
        raise Exception(f"Search failed: {response.text}")

    result_url = response.json()["results"]
    return result_url

# === Fetch and Display Results ===
def fetch_results(session_key, results_url):
    headers = {
        "Authorization": f"Splunk {session_key}"
    }
    full_url = f"{SPLUNK_HOST}{results_url}?output_mode=json"
    response = requests.get(full_url, headers=headers, verify=False)
    if response.status_code != 200:
        raise Exception(f"Failed to get results: {response.text}")
    results = response.json().get("results", [])
    return results

# === Main Logic ===
if __name__ == "__main__":
    print("🔍 Connecting to Splunk and monitoring forwarder queues...")
    try:
        session_key = get_session_key()
        results_url = run_search(session_key, QUERY)
        results = fetch_results(session_key, results_url)

        print("\n📊 Queue Sizes (avg):\n")
        for entry in results:
            queue = entry.get("name")
            avg_size = entry.get("avg(current_size)")
            print(f" - {queue}: {avg_size}")

    except Exception as e:
        print(f"❌ Error: {e}")

🔧 Prerequisites

  • Enable Splunk REST API on port 8089.

  • Ensure index=_internal is retained and not filtered.

  • The user must have search and admin_all_objects capabilities.


📈 Output Example

📊 Queue Sizes (avg):

 - parsingQueue: 0.05
 - indexingQueue: 1.23
 - typingQueue: 0.00

Let us Automate Monitoring by setting up script that runs on a schedule (every 60 seconds by default) and sends alerts if any Splunk queue size exceeds a defined threshold.

Scheduled Splunk Queue Monitoring with Alerting (Python)

import requests
import urllib3
import json
import time
from datetime import datetime

# Suppress SSL warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# === CONFIGURATION ===
SPLUNK_HOST = "https://your-splunk-host:8089"
USERNAME = "admin"
PASSWORD = "your_password"
QUEUE_THRESHOLD = 3.0  # alert if avg(current_size) > this value
INTERVAL_SECONDS = 60  # how often to run the check

# Splunk SPL query
QUERY = """search index=_internal source=*metrics.log group=queue earliest=-5m
           | stats avg(current_size) as avg_size by name
           | where avg_size > 0"""

# === Authenticate to Splunk ===
def get_session_key():
    url = f"{SPLUNK_HOST}/services/auth/login"
    data = {"username": USERNAME, "password": PASSWORD}
    response = requests.post(url, data=data, verify=False)
    if response.status_code != 200:
        raise Exception(f"Login failed: {response.text}")
    return response.text.split("<sessionKey>")[1].split("</sessionKey>")[0]

# === Run a Splunk search job ===
def run_search(session_key, query):
    headers = {"Authorization": f"Splunk {session_key}"}
    search_url = f"{SPLUNK_HOST}/services/search/jobs/export"
    data = {
        "search": query,
        "output_mode": "json"
    }
    response = requests.post(search_url, headers=headers, data=data, stream=True, verify=False)
    if response.status_code != 200:
        raise Exception(f"Search failed: {response.text}")

    # Parse streaming JSON output
    results = []
    for line in response.iter_lines():
        if line:
            json_line = json.loads(line.decode('utf-8'))
            if 'result' in json_line:
                results.append(json_line['result'])
    return results

# === Alert Logic ===
def check_for_alerts(results):
    alerts = []
    for entry in results:
        queue = entry.get("name")
        avg_size = float(entry.get("avg_size", 0))
        if avg_size > QUEUE_THRESHOLD:
            alerts.append((queue, avg_size))
    return alerts

# === Main Loop ===
if __name__ == "__main__":
    print(f"🔁 Starting Splunk queue monitoring every {INTERVAL_SECONDS} seconds...\n")
    try:
        session_key = get_session_key()
        while True:
            print(f"⏱️  [{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] Checking queue sizes...")
            results = run_search(session_key, QUERY)
            alerts = check_for_alerts(results)

            if alerts:
                print("🚨 ALERT: Queue thresholds exceeded:")
                for queue, size in alerts:
                    print(f" - {queue}: avg size = {size:.2f}")
            else:
                print("✅ All queues healthy.\n")

            time.sleep(INTERVAL_SECONDS)
    except Exception as e:
        print(f"❌ Error: {e}")

📦 Output Example

⏱️  [2025-08-01 15:00:00] Checking queue sizes...
🚨 ALERT: Queue thresholds exceeded:
 - indexingQueue: avg size = 6.21

🔔 Alerting Options

  • Send an email (smtplib)

  • Trigger a PagerDuty or Datadog event

  • Log to a file or external system


🔔 Triggering External Alerts: PagerDuty and Datadog Events

Once you detect a queue backlog or a service failure, triggering an alert in your incident management system helps notify the right teams quickly.


🛠️ Trigger a PagerDuty Event

✅ Bash Script – Send PagerDuty Incident

#!/bin/bash

PD_ROUTING_KEY="your_integration_key"  
# From Events V2 integration
SUMMARY="High log queue detected on Splunk Forwarder"
SOURCE="splunk-monitor"
SEVERITY="critical"

curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H "Content-Type: application/json" \
  -d @- <<EOF
{
  "routing_key": "$PD_ROUTING_KEY",
  "event_action": "trigger",
  "payload": {
    "summary": "$SUMMARY",
    "source": "$SOURCE",
    "severity": "$SEVERITY"
  }
}

📝 Replace your_integration_key with your Events API v2 routing key from PagerDuty.


✅ Python Script to Send PagerDuty Incident

import requests
import json

PD_ROUTING_KEY = "your_integration_key"
PD_ENDPOINT = "https://events.pagerduty.com/v2/enqueue"

payload = {
    "routing_key": PD_ROUTING_KEY,
    "event_action": "trigger",
    "payload": {
        "summary": "Splunk queue exceeds threshold",
        "source": "splunk-forwarder-monitor",
        "severity": "critical"
    }
}

response = requests.post(PD_ENDPOINT, json=payload)
print("PagerDuty response:", response.status_code, response.text)

🐶 Trigger a Datadog Event

✅ Bash Script – Send Datadog Event

#!/bin/bash

API_KEY="your_datadog_api_key"
TITLE="Splunk Forwarder Alert"
TEXT="One or more queues are delayed"
PRIORITY="normal"
ALERT_TYPE="error"

curl -X POST "https://api.datadoghq.com/api/v1/events?api_key=${API_KEY}" \
  -H "Content-Type: application/json" \
  -d @- <<EOF
{
  "title": "$TITLE",
  "text": "$TEXT",
  "priority": "$PRIORITY",
  "alert_type": "$ALERT_TYPE",
  "tags": ["source:splunk", "component:forwarder"]
}
EOF

✅ Script to Send Datadog Event

import requests

API_KEY = "your_datadog_api_key"
url = f"https://api.datadoghq.com/api/v1/events?api_key={API_KEY}"

data = {
    "title": "Splunk Forwarder Delay Detected",
    "text": "One or more forwarder queues exceeded the threshold.",
    "priority": "normal",
    "alert_type": "error",
    "tags": ["splunk", "monitoring", "forwarder"]
}

response = requests.post(url, json=data)
print("Datadog response:", response.status_code, response.text)

🧠 Best Practices for Event Triggering

  • Deduplicate Events: Use a consistent dedup_key (PagerDuty) or event title (Datadog) to avoid flooding.

  • Severity Mapping: Adjust severity or alert_type based on queue size or system impact.

  • Rate Limits: Both APIs have rate limits. Don’t trigger events in tight loops; use cool-down intervals.


✅ Bringing It All Together

Integrating your Splunk monitoring scripts with Datadog or PagerDuty enables you to automate incident notification at the moment a problem arises. When a queue exceeds a threshold or a log delay is detected:

  1. Your monitoring script checks the condition.

  2. It triggers an event in your incident response tool.

  3. Your team is notified in real-time to investigate or auto-remediate.

0
Subscribe to my newsletter

Read articles from Pratiksha kadam directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Pratiksha kadam
Pratiksha kadam