AWS Cloud Monitoring Made Easy

Table of contents
- Why Monitor AWS Applications?
- Getting Started with Datadog
- Leveraging Splunk for Log Analysis
- Best Practices for AWS Monitoring
- Bringing It All Together
- Advanced Configuration and Automation
- Monitoring Cost Optimization
- Troubleshooting Common Issues
- Scaling Your Monitoring Strategy
- ✅ 1. Implement Indexer Clustering
- ✅ 2. Monitor Forwarder Queues
- ✅ 3. Tune Forwarder Settings
- ✅ 4. Enable Load Balancing
- ✅ 5. Review and Reduce Noise
- ✅ 6. Scale Horizontally
- ✅ Python Script: Splunk Queue Monitoring
- 🔧 Prerequisites
- 📈 Output Example
- ✅ Scheduled Splunk Queue Monitoring with Alerting (Python)
- 📦 Output Example
- 🔔 Alerting Options
- 🔔 Triggering External Alerts: PagerDuty and Datadog Events
- 🧠 Best Practices for Event Triggering
- ✅ Bringing It All Together
When your applications live in the cloud, visibility becomes your lifeline. Modern cloud architectures create complex interdependencies where a single failing component can cascade into system-wide outages.
You need to know what's happening across your AWS infrastructure before your users experience any impact. The challenge lies not just in collecting data, but in transforming that data into actionable insights that drive operational excellence.
Today, let's explore how to set up comprehensive monitoring using two industry leaders: Datadog and Splunk, and discover how their complementary strengths can create a robust observability strategy.
Why Monitor AWS Applications?
Cloud applications are distributed by nature, creating a complex web of interconnected services that span multiple availability zones, regions, and service boundaries.
Your web servers might be in one availability zone, your database in another, and your cache layer somewhere else entirely, each with its own performance characteristics and failure modes.
Without proper monitoring, troubleshooting becomes guesswork, leading to extended downtime and frustrated users. The ephemeral nature of cloud resources means instances can terminate unexpectedly, auto-scaling events can mask underlying issues, and microservices can fail silently while appearing healthy from the outside. You need metrics, logs, and traces working together to paint the complete picture of your application's health, performance bottlenecks, and user experience impact across the entire technology stack.
Getting Started with Datadog
Datadog excels at infrastructure monitoring and application performance tracking, offering a unified platform that correlates metrics, traces, and logs in real-time. Its strength lies in providing immediate visibility into system performance with minimal configuration overhead, making it ideal for fast-moving development teams. The platform's machine learning capabilities can automatically detect anomalies, predict capacity issues, and suggest optimization opportunities based on historical patterns. Here's how to get your AWS resources talking to Datadog and unlock comprehensive observability:
Step 1: Install the Datadog Agent
For AWS EC2 instances, the agent installation is straightforward:
DD_API_KEY=your_api_key bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
Step 2: Configure AWS Integration
Create an IAM role with the necessary permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:List*",
"cloudwatch:Get*",
"ec2:Describe*",
"support:*",
"tag:GetResources"
],
"Resource": "*"
}
]
}
This integration automatically pulls metrics from CloudWatch, giving you visibility into EC2, RDS, Lambda, and other AWS services without additional configuration.
Step 3: Set Up Application Monitoring
For application-level insights, instrument your code with Datadog's APM:
from ddtrace import tracer
from ddtrace.contrib.flask import TraceMiddleware
app = Flask(__name__)
TraceMiddleware(app, tracer, service="my-web-app")
Leveraging Splunk for Log Analysis
While Datadog handles metrics beautifully, Splunk shines when you need deep log analysis and custom searches across massive datasets.
Step 1: Configure Log Forwarding
Set up the Splunk Universal Forwarder on your EC2 instances:
./splunkforwarder start --accept-license
./splunk add forward-server your-splunk-server:9997
./splunk add monitor /var/log/application/
Step 2: Create Custom Dashboards
Splunk's search language lets you create powerful queries:
index=application_logs level=ERROR
| timechart span=5m count by source
| eval threshold=50
| where count > threshold
This query identifies error spikes across your applications in real-time.
Best Practices for AWS Monitoring
Tag Everything: Use consistent tagging across AWS resources. Both Datadog and Splunk can filter and group data based on tags, making troubleshooting much easier.
Set Meaningful Alerts: Don't alert on everything. Focus on metrics that directly impact user experience - response times, error rates, and business-critical processes.
Monitor the Full Stack: Track infrastructure metrics (CPU, memory), application metrics (response time, throughput), and business metrics (user signups, revenue) together.
Use Custom Metrics: Both platforms support custom metrics. Track what matters to your specific application - queue lengths, cache hit rates, or API call success rates.
Bringing It All Together
The real power comes from using both tools together. Datadog provides real-time infrastructure and application monitoring with beautiful visualizations. Splunk offers deep log analysis and complex event correlation.
When an alert fires in Datadog, you can immediately jump to Splunk to analyze the related log data. This combination gives you both the early warning system and the detailed forensic capabilities you need to maintain reliable cloud applications.
Remember: monitoring isn't about collecting every possible metric. It's about having the right information at the right time to make informed decisions about your applications' health and performance.
Advanced Configuration and Automation
To truly scale your monitoring setup, automation becomes essential. Let's explore advanced configurations and deployment scripts that will save you hours of manual work.
Automated Datadog Agent Deployment
Here's a comprehensive script to deploy Datadog agents across multiple EC2 instances:
#!/bin/bash
# datadog-deploy.sh - Mass deployment script for Datadog agents
DATADOG_API_KEY="your_api_key_here"
INSTANCE_IDS=("i-1234567890abcdef0" "i-0987654321fedcba0")
for instance_id in "${INSTANCE_IDS[@]}"; do
echo "Deploying Datadog agent to instance: $instance_id"
aws ssm send-command \
--instance-ids "$instance_id" \
--document-name "AWS-RunShellScript" \
--parameters "commands=[
'DD_API_KEY=$DATADOG_API_KEY bash -c \"\$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)\"',
'sudo systemctl enable datadog-agent',
'sudo systemctl start datadog-agent',
'echo \"Datadog agent deployed successfully on $instance_id\"'
]" \
--output text
done
Container Monitoring Setup
For containerized applications, monitoring becomes more complex. Here's a Docker Compose configuration that includes monitoring:
#!/bin/bash
# container-monitoring-setup.sh - Setup monitoring for containerized apps
cat > docker-compose.monitoring.yml << 'EOF'
version: '3.8'
services:
datadog-agent:
image: datadog/agent:latest
environment:
- DD_API_KEY=${DATADOG_API_KEY}
- DD_SITE=datadoghq.com
- DD_CONTAINER_EXCLUDE="name:datadog-agent"
- DD_APM_ENABLED=true
- DD_LOGS_ENABLED=true
- DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /proc/:/host/proc/:ro
- /opt/datadog-agent/run:/opt/datadog-agent/run:rw
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
networks:
- monitoring
app:
build: .
environment:
- DD_TRACE_AGENT_HOSTNAME=datadog-agent
- DD_ENV=production
- DD_SERVICE=my-web-app
depends_on:
- datadog-agent
networks:
- monitoring
networks:
monitoring:
driver: bridge
EOF
echo "Starting monitoring stack..."
docker-compose -f docker-compose.monitoring.yml up -d
CloudWatch Integration Script
Automate the setup of CloudWatch log groups and streams for Splunk ingestion:
#!/bin/bash
# cloudwatch-splunk-integration.sh - Setup CloudWatch to Splunk integration
SPLUNK_HEC_ENDPOINT="https://your-splunk-instance:8088/services/collector"
SPLUNK_HEC_TOKEN="your-hec-token"
LOG_GROUP_NAME="/aws/lambda/my-application"
# Create CloudWatch log group
aws logs create-log-group --log-group-name "$LOG_GROUP_NAME"
# Create subscription filter for Splunk
aws logs put-subscription-filter \
--log-group-name "$LOG_GROUP_NAME" \
--filter-name "SplunkSubscriptionFilter" \
--filter-pattern "" \
--destination-arn "arn:aws:lambda:us-east-1:123456789012:function:splunk-cloudwatch-processor"
# Create Lambda function for log processing
cat > lambda-function.py << 'EOF'
import json
import gzip
import base64
import requests
def lambda_handler(event, context):
compressed_payload = base64.b64decode(event['awslogs']['data'])
uncompressed_payload = gzip.decompress(compressed_payload)
log_data = json.loads(uncompressed_payload)
events = []
for log_event in log_data['logEvents']:
events.append({
'time': log_event['timestamp'] / 1000,
'event': log_event['message'],
'source': log_data['logGroup'],
'sourcetype': 'aws:cloudwatch'
})
headers = {
'Authorization': f'Splunk {SPLUNK_HEC_TOKEN}',
'Content-Type': 'application/json'
}
for event in events:
response = requests.post(SPLUNK_HEC_ENDPOINT,
headers=headers,
data=json.dumps(event))
return {'statusCode': 200}
EOF
echo "CloudWatch to Splunk integration configured successfully"
Health Check and Alerting Script
Create comprehensive health checks across your monitoring infrastructure:
#!/bin/bash
# monitoring-health-check.sh
- Verify monitoring stack health
DATADOG_API_KEY="your_api_key"
DATADOG_APP_KEY="your_app_key"
SPLUNK_ENDPOINT="https://your-splunk-instance:8089"
SPLUNK_TOKEN="your_splunk_token"
echo "=== Monitoring Stack Health Check ==="
# Check Datadog API connectivity
datadog_status=$(curl -s -o /dev/null -w "%{http_code}" \
-H "DD-API-KEY: $DATADOG_API_KEY" \
-H "DD-APPLICATION-KEY: $DATADOG_APP_KEY" \
"https://api.datadoghq.com/api/v1/validate")
if [ "$datadog_status" -eq 200 ]; then
echo "✓ Datadog API: Connected"
else
echo "✗ Datadog API: Connection failed (HTTP $datadog_status)"
fi
# Check Splunk connectivity
splunk_status=$(curl -s -o /dev/null -w "%{http_code}" \
-H "Authorization: Bearer $SPLUNK_TOKEN" \
"$SPLUNK_ENDPOINT/services/server/info")
if [ "$splunk_status" -eq 200 ]; then
echo "✓ Splunk API: Connected"
else
echo "✗ Splunk API: Connection failed (HTTP $splunk_status)"
fi
# Check agent status on local machine
if systemctl is-active --quiet datadog-agent; then
echo "✓ Datadog Agent: Running"
else
echo "✗ Datadog Agent: Not running"
fi
# Check disk space for logs
log_disk_usage=$(df /var/log | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$log_disk_usage" -lt 80 ]; then
echo "✓ Log Disk Usage: ${log_disk_usage}% (healthy)"
else
echo "⚠ Log Disk Usage: ${log_disk_usage}% (warning)"
fi
echo "=== Health Check Complete ==="
Monitoring Cost Optimization
AWS monitoring can become expensive quickly. Here are strategies to optimize costs while maintaining visibility:
Selective Metric Collection: Don't collect every available metric. Focus on business-critical indicators and use Datadog's metric filters to reduce ingestion costs.
Log Retention Policies: Implement intelligent log retention. Keep recent logs in real-time systems and archive older data to S3 for compliance.
Sampling Strategies: For high-traffic applications, implement trace sampling in Datadog APM. Sample 10% of traces instead of 100% to reduce costs while maintaining statistical significance.
Troubleshooting Common Issues
Agent Connection Problems: If agents can't reach Datadog, check security groups and NACLs. The agent needs outbound HTTPS access to Datadog's endpoints.
Missing Metrics: Verify IAM permissions for the CloudWatch integration. The integration role needs proper read permissions for all services you want to monitor.
Log Ingestion Delays: Splunk Universal Forwarders can fall behind during high log volume. Configure proper indexer clustering and monitor forwarder queues.
Scaling Your Monitoring Strategy
As your AWS infrastructure grows, your monitoring needs will evolve. Plan for:
Multi-Account Strategy: Use Datadog's AWS integration across multiple accounts. Set up cross-account roles for centralized monitoring.
Custom Metrics: Develop application-specific metrics that align with business objectives. Monitor user experience, not just system performance.
Automated Remediation: Integrate monitoring alerts with AWS Lambda or Systems Manager for automatic incident response.
The key to successful cloud monitoring is continuous iteration. Start with basic coverage, then expand based on your team's actual troubleshooting patterns and business requirements.
To address log ingestion delays in Splunk due to high log volume from Universal Forwarders, follow these practical steps:
✅ 1. Implement Indexer Clustering
Ensure your Splunk deployment can scale with incoming data volume.
Enable indexer clustering (search head + multiple indexers with replication).
This distributes load, improves fault tolerance, and speeds up indexing.
Steps:
In
server.conf
on each indexer:[clustering] mode = slave master_uri = https://<cluster-master>:8089 replication_port = 9887
Restart Splunk on indexers and connect them to the cluster master node.
✅ 2. Monitor Forwarder Queues
Universal Forwarders use internal queues (parsingQueue
, aggregationQueue
, etc.)—monitor them to detect bottlenecks.
Enable queue monitoring:
Edit inputs.conf on the indexer:
[splunktcp://9997] queue = parsingQueue
Search internal logs to visualize queues:
index=_internal source=*metrics.log group=queue | timechart avg(current_size) by name
✅ 3. Tune Forwarder Settings
Improve buffer and throughput settings on Universal Forwarders.
Modify
outputs.conf
:[tcpout] maxQueueSize = 512MB autoLBFrequency = 30 sendCookedData = true
Consider increasing pipeline batch sizes or buffering thresholds.
✅ 4. Enable Load Balancing
If you have multiple indexers, ensure Universal Forwarders load balance across them.
ini[tcpout:my_indexers]
server = indexer1:9997,indexer2:9997,indexer3:9997
autoLBFrequency = 30
✅ 5. Review and Reduce Noise
High log volume often includes excessive noise.
Use
transforms.conf
to filter logs at the forwarder before they’re sent.Only forward necessary logs, especially during peak hours.
✅ 6. Scale Horizontally
If ingestion volume continues to rise:
Add more indexers to distribute load.
Use heavy forwarders for pre-processing if transformation is needed.
To connect to a Splunk instance using the Splunk REST API, queries internal metrics logs, and monitors queue delays (e.g., parsingQueue, indexingQueue) from Universal Forwarders in real time.
✅ Python Script: Splunk Queue Monitoring
import requests
import urllib3
import json
import time
# Suppress SSL warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# === Splunk API Configuration ===
SPLUNK_HOST = "https://your-splunk-host:8089"
USERNAME = "admin"
PASSWORD = "your_password"
QUERY = """search index=_internal source=*metrics.log group=queue | stats avg(current_size) by name"""
# === Authenticate to Splunk ===
def get_session_key():
url = f"{SPLUNK_HOST}/services/auth/login"
data = {"username": USERNAME, "password": PASSWORD}
response = requests.post(url, data=data, verify=False)
if response.status_code != 200:
raise Exception(f"Login failed: {response.text}")
session_key = response.text.split("<sessionKey>")[1].split("</sessionKey>")[0]
return session_key
# === Run Search Job ===
def run_search(session_key, query):
headers = {
"Authorization": f"Splunk {session_key}"
}
search_url = f"{SPLUNK_HOST}/services/search/jobs"
data = {
"search": query,
"exec_mode": "blocking",
"output_mode": "json"
}
response = requests.post(search_url, headers=headers, data=data, verify=False)
if response.status_code != 200:
raise Exception(f"Search failed: {response.text}")
result_url = response.json()["results"]
return result_url
# === Fetch and Display Results ===
def fetch_results(session_key, results_url):
headers = {
"Authorization": f"Splunk {session_key}"
}
full_url = f"{SPLUNK_HOST}{results_url}?output_mode=json"
response = requests.get(full_url, headers=headers, verify=False)
if response.status_code != 200:
raise Exception(f"Failed to get results: {response.text}")
results = response.json().get("results", [])
return results
# === Main Logic ===
if __name__ == "__main__":
print("🔍 Connecting to Splunk and monitoring forwarder queues...")
try:
session_key = get_session_key()
results_url = run_search(session_key, QUERY)
results = fetch_results(session_key, results_url)
print("\n📊 Queue Sizes (avg):\n")
for entry in results:
queue = entry.get("name")
avg_size = entry.get("avg(current_size)")
print(f" - {queue}: {avg_size}")
except Exception as e:
print(f"❌ Error: {e}")
🔧 Prerequisites
Enable Splunk REST API on port 8089.
Ensure
index=_internal
is retained and not filtered.The user must have
search
andadmin_all_objects
capabilities.
📈 Output Example
📊 Queue Sizes (avg):
- parsingQueue: 0.05
- indexingQueue: 1.23
- typingQueue: 0.00
Let us Automate Monitoring by setting up script that runs on a schedule (every 60 seconds by default) and sends alerts if any Splunk queue size exceeds a defined threshold.
✅ Scheduled Splunk Queue Monitoring with Alerting (Python)
import requests
import urllib3
import json
import time
from datetime import datetime
# Suppress SSL warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# === CONFIGURATION ===
SPLUNK_HOST = "https://your-splunk-host:8089"
USERNAME = "admin"
PASSWORD = "your_password"
QUEUE_THRESHOLD = 3.0 # alert if avg(current_size) > this value
INTERVAL_SECONDS = 60 # how often to run the check
# Splunk SPL query
QUERY = """search index=_internal source=*metrics.log group=queue earliest=-5m
| stats avg(current_size) as avg_size by name
| where avg_size > 0"""
# === Authenticate to Splunk ===
def get_session_key():
url = f"{SPLUNK_HOST}/services/auth/login"
data = {"username": USERNAME, "password": PASSWORD}
response = requests.post(url, data=data, verify=False)
if response.status_code != 200:
raise Exception(f"Login failed: {response.text}")
return response.text.split("<sessionKey>")[1].split("</sessionKey>")[0]
# === Run a Splunk search job ===
def run_search(session_key, query):
headers = {"Authorization": f"Splunk {session_key}"}
search_url = f"{SPLUNK_HOST}/services/search/jobs/export"
data = {
"search": query,
"output_mode": "json"
}
response = requests.post(search_url, headers=headers, data=data, stream=True, verify=False)
if response.status_code != 200:
raise Exception(f"Search failed: {response.text}")
# Parse streaming JSON output
results = []
for line in response.iter_lines():
if line:
json_line = json.loads(line.decode('utf-8'))
if 'result' in json_line:
results.append(json_line['result'])
return results
# === Alert Logic ===
def check_for_alerts(results):
alerts = []
for entry in results:
queue = entry.get("name")
avg_size = float(entry.get("avg_size", 0))
if avg_size > QUEUE_THRESHOLD:
alerts.append((queue, avg_size))
return alerts
# === Main Loop ===
if __name__ == "__main__":
print(f"🔁 Starting Splunk queue monitoring every {INTERVAL_SECONDS} seconds...\n")
try:
session_key = get_session_key()
while True:
print(f"⏱️ [{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] Checking queue sizes...")
results = run_search(session_key, QUERY)
alerts = check_for_alerts(results)
if alerts:
print("🚨 ALERT: Queue thresholds exceeded:")
for queue, size in alerts:
print(f" - {queue}: avg size = {size:.2f}")
else:
print("✅ All queues healthy.\n")
time.sleep(INTERVAL_SECONDS)
except Exception as e:
print(f"❌ Error: {e}")
📦 Output Example
⏱️ [2025-08-01 15:00:00] Checking queue sizes...
🚨 ALERT: Queue thresholds exceeded:
- indexingQueue: avg size = 6.21
🔔 Alerting Options
Send an email (
smtplib
)Trigger a PagerDuty or Datadog event
Log to a file or external system
🔔 Triggering External Alerts: PagerDuty and Datadog Events
Once you detect a queue backlog or a service failure, triggering an alert in your incident management system helps notify the right teams quickly.
🛠️ Trigger a PagerDuty Event
✅ Bash Script – Send PagerDuty Incident
#!/bin/bash
PD_ROUTING_KEY="your_integration_key"
# From Events V2 integration
SUMMARY="High log queue detected on Splunk Forwarder"
SOURCE="splunk-monitor"
SEVERITY="critical"
curl -X POST https://events.pagerduty.com/v2/enqueue \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"routing_key": "$PD_ROUTING_KEY",
"event_action": "trigger",
"payload": {
"summary": "$SUMMARY",
"source": "$SOURCE",
"severity": "$SEVERITY"
}
}
📝 Replace
your_integration_key
with your Events API v2 routing key from PagerDuty.
✅ Python Script to Send PagerDuty Incident
import requests
import json
PD_ROUTING_KEY = "your_integration_key"
PD_ENDPOINT = "https://events.pagerduty.com/v2/enqueue"
payload = {
"routing_key": PD_ROUTING_KEY,
"event_action": "trigger",
"payload": {
"summary": "Splunk queue exceeds threshold",
"source": "splunk-forwarder-monitor",
"severity": "critical"
}
}
response = requests.post(PD_ENDPOINT, json=payload)
print("PagerDuty response:", response.status_code, response.text)
🐶 Trigger a Datadog Event
✅ Bash Script – Send Datadog Event
#!/bin/bash
API_KEY="your_datadog_api_key"
TITLE="Splunk Forwarder Alert"
TEXT="One or more queues are delayed"
PRIORITY="normal"
ALERT_TYPE="error"
curl -X POST "https://api.datadoghq.com/api/v1/events?api_key=${API_KEY}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"title": "$TITLE",
"text": "$TEXT",
"priority": "$PRIORITY",
"alert_type": "$ALERT_TYPE",
"tags": ["source:splunk", "component:forwarder"]
}
EOF
✅ Script to Send Datadog Event
import requests
API_KEY = "your_datadog_api_key"
url = f"https://api.datadoghq.com/api/v1/events?api_key={API_KEY}"
data = {
"title": "Splunk Forwarder Delay Detected",
"text": "One or more forwarder queues exceeded the threshold.",
"priority": "normal",
"alert_type": "error",
"tags": ["splunk", "monitoring", "forwarder"]
}
response = requests.post(url, json=data)
print("Datadog response:", response.status_code, response.text)
🧠 Best Practices for Event Triggering
Deduplicate Events: Use a consistent
dedup_key
(PagerDuty) or eventtitle
(Datadog) to avoid flooding.Severity Mapping: Adjust
severity
oralert_type
based on queue size or system impact.Rate Limits: Both APIs have rate limits. Don’t trigger events in tight loops; use cool-down intervals.
✅ Bringing It All Together
Integrating your Splunk monitoring scripts with Datadog or PagerDuty enables you to automate incident notification at the moment a problem arises. When a queue exceeds a threshold or a log delay is detected:
Your monitoring script checks the condition.
It triggers an event in your incident response tool.
Your team is notified in real-time to investigate or auto-remediate.
Subscribe to my newsletter
Read articles from Pratiksha kadam directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
