Building a Production-Grade Monitoring Stack with Ansible: A Complete Guide

Table of contents
- Introduction
- 1. Prometheus - The Time-Series Database
- 2. Node Exporter - System Health Monitor
- 3. Blackbox Exporter - Website Health Checker
- 4. Alertmanager - Smart Notification System
- 5. DORA Metrics - DevOps Performance Analyzer
- 6. Grafana - Beautiful Dashboard Creator
- Architecture Overview
- Repository
- Infrastructure Overview
- Setting Up Prometheus and Grafana
- DORA Metrics in Detail
- System Monitoring with Node Exporter
- Endpoint Monitoring with Blackbox Exporter
- Project Structure
- Step-by-Step Deployment Guide
- Troubleshooting Guide
- Conclusion
- Contributing
- References

Introduction
In today's cloud-native world, robust monitoring is critical for maintaining reliable systems. This guide walks you through implementing a comprehensive monitoring solution using industry-standard open-source tools. Our stack includes Prometheus for metrics collection, Grafana for visualization, Alertmanager for notifications, and various exporters for gathering system and application metrics.
The entire infrastructure is automated using Terraform for provisioning and Ansible for configuration management, with CI/CD pipelines for the Node.js application, making the entire solution repeatable, maintainable, and easy to deploy across multiple environments.
1. Prometheus - The Time-Series Database
Think of Prometheus as a highly sophisticated data collector and storage system. It's like having a super-powered spreadsheet that:
- Records metrics (measurements) over time
- Stores them in a time-series database
- Allows you to query this data using a powerful language called PromQL
- Can trigger alerts based on conditions you define
Real-world analogy: Imagine a smart thermostat that not only measures temperature but also:
- Records temperature every minute
- Stores historical data
- Can tell you the average temperature over any time period
- Alerts you if the temperature goes too high or low
2. Node Exporter - System Health Monitor
Node Exporter is like having a health check-up device for your server. It collects metrics about:
- CPU usage (how hard your server is working)
- Memory usage (how much RAM is being used)
- Disk space (how full your storage is)
- Network traffic (how much data is moving in and out)
- System load (how many tasks are waiting to be processed)
Real-world analogy: Think of it as a car's dashboard that shows:
- Engine temperature
- Fuel level
- Speed
- Oil pressure
- Battery status
3. Blackbox Exporter - Website Health Checker
Blackbox Exporter is like having a website monitoring service that:
- Checks if your website is up and running
- Measures how fast it responds
- Verifies SSL certificates are valid
- Monitors DNS resolution
- Checks if specific endpoints are accessible
Real-world analogy: Imagine a security guard who:
- Checks if the store is open
- Verifies the entrance is accessible
- Makes sure the security system is working
- Monitors response times to customer requests
4. Alertmanager - Smart Notification System
Alertmanager is your intelligent notification manager that:
- Receives alerts from Prometheus
- Groups similar alerts together
- Sends notifications to the right people
- Prevents alert fatigue by managing how often alerts are sent
- Routes different types of alerts to different channels
Real-world analogy: Think of it as a smart receptionist who:
- Receives emergency calls
- Decides which department should handle each issue
- Groups similar problems together
- Makes sure the right person gets notified
- Prevents the same issue from waking up multiple people
5. DORA Metrics - DevOps Performance Analyzer
DORA (DevOps Research and Assessment) Metrics is like having a performance analytics system for your development team that measures:
- How often you deploy code (Deployment Frequency)
- How long it takes to get changes into production (Lead Time for Changes)
- How quickly you can fix problems (Mean Time to Recovery)
- How often deployments fail (Change Failure Rate)
Real-world analogy: Imagine a sports analytics system that tracks:
- How many games a team plays (Deployment Frequency)
- How long it takes to get new players ready (Lead Time)
- How quickly they recover from injuries (MTTR)
- How many games they lose due to mistakes (Failure Rate)
6. Grafana - Beautiful Dashboard Creator
Grafana is like having a customizable control center that:
- Creates beautiful visualizations of your metrics
- Combines data from multiple sources
- Allows you to create custom dashboards
- Provides real-time monitoring
- Enables historical data analysis
Real-world analogy: Think of it as a modern car's infotainment system that:
- Shows multiple gauges and graphs
- Displays navigation, music, and vehicle status
- Allows you to customize what information you see
- Updates in real-time
- Shows historical data about your trips
Or, let’s look at the example below for better context. Imagine a complete monitoring system like a modern hospital:
- Prometheus is like the hospital's central monitoring system that collects all patient data
- Node Exporter is like the vital signs monitor on each patient's bed
- Blackbox Exporter is like the security system checking if all doors and emergency exits are working
- Alertmanager is like the nurse's station that receives alerts and routes them to the right doctors
- DORA Metrics is like the hospital's performance analytics department
- Grafana is like the hospital's command center with multiple screens showing different aspects of hospital operations
When something goes wrong:
- The vital signs monitor (Node Exporter) detects an issue
- The central system (Prometheus) records it
- The nurse's station (Alertmanager) receives the alert
- The right doctor gets notified
- The command center (Grafana) shows the current status
- The analytics department (DORA Metrics) tracks how well the hospital handled the situation
This monitoring stack helps you:
- Know when something is wrong before your users do
- Understand why things went wrong
- Track how well your system is performing
- Measure your team's effectiveness
- Make data-driven decisions about improvements
So with this understanding, let's dive into the architecture overview.
Architecture Overview
┌──────────────────────────────────────────────────────────────────┐
│ DevOps Toolchain │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────────────────┐ │
│ │ Terraform │ │ Ansible │ │ GitHub Actions │ │
│ │ Infrastructure│ │Configuration│ │ CI/CD Pipeline │ │
│ │ as Code │───▶│ Management │───▶│ │ │
│ └────────────┘ └────────────┘ └────────────────────────┘ │
│ │ │ │ │
└─────────┼────────────────┼──────────────────────┼─────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────────────────────┐
│ Azure Cloud │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Virtual Machine │ │
│ │ │ │
│ │ ┌────────────────┐ ┌───────────────────────────┐ │ │
│ │ │ Node.js App │◀───────│ GitHub Webhook Trigger │ │ │
│ │ │ │ │ │ │ │
│ │ │ ┌────────────┐ │ └───────────────────────────┘ │ │
│ │ │ │PM2 Process │ │ │ │
│ │ │ │ Manager │ │ │ │
│ │ │ └────────────┘ │ │ │
│ │ └────────┬───────┘ │ │
│ │ │ Exposes │ │
│ │ └─────────Metrics─────┐ │ │
│ │ │ │ │
│ │ ┌─────────────────────────────▼──────────────────────┐ │ │
│ │ │ Monitoring Stack │ │ │
│ │ │ │ │ │
│ │ │ ┌──────────────┐ ┌──────────────┐ ┌───────┐ │ │ │
│ │ │ │ Prometheus │───▶│ Alertmanager │───▶│ Slack │ │ │ │
│ │ │ │ Time-series │ │ Alert │ │ │ │ │ │
│ │ │ │ Database │◀───┤ Routing │ └───────┘ │ │ │
│ │ │ └──────┬───────┘ └──────────────┘ │ │ │
│ │ │ │ │ │ │
│ │ │ │ ┌────────────────┐ │ │ │
│ │ │ └───▶│ Exporters │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ ┌────────────┐ │ │ │ │
│ │ │ │ │ Node │ │ │ │ │
│ │ │ │ │ Exporter │ │ │ │ │
│ │ │ │ └────────────┘ │ │ │ │
│ │ │ │ ┌────────────┐ │ │ │ │
│ │ │ │ │ Blackbox │ │ │ │ │
│ │ │ │ │ Exporter │ │ │ │ │
│ │ │ │ └────────────┘ │ │ │ │
│ │ │ │ ┌────────────┐ │ │ │ │
│ │ │ │ │ DORA │ │ │ │ │
│ │ │ │ │ Metrics │◀┼────GitHub API │ │ │
│ │ │ │ └────────────┘ │ │ │ │
│ │ │ └────────────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Repository
All code for this project is available in the devops-monitoring-stack repository. Feel free to clone, fork, or contribute to the project!
git clone https://github.com/tochinicky/devops-monitoring-stack.git
cd devops-monitoring-stack
CI/CD Pipeline for Node.js Application
graph LR
A[Developer] -->|Push| B[Git Repository]
B -->|Trigger| C[GitHub Actions]
subgraph "CI Pipeline"
C -->|Run| D[Lint Code]
C -->|Run| E[Unit Tests]
C -->|Run| F[Build App]
C -->|Run| G[Generate Artifacts]
end
subgraph "CD Pipeline"
C -->|Deploy to| H[Staging]
H -->|Automated Tests| I[Integration Tests]
I -->|Success| J[Deploy to Production]
end
J -->|Deploy| K[Application Server]
J -->|Update| L[DORA Metrics]
L -->|Tracked in| M[Prometheus]
M -->|Visualized in| N[Grafana]
Infrastructure Overview
graph TB
A[Terraform Code] -->|Plan| B[Infrastructure Plan]
B -->|Apply| C[Azure Resources]
subgraph "Azure Resources"
C -->|Create| D[Virtual Machine]
C -->|Configure| E[Virtual Network]
C -->|Set Up| F[Network Security Groups]
C -->|Allocate| G[Public IP]
C -->|Create| H[Storage Account]
end
C -->|Output| I[VM IP Address]
C -->|Output| J[DNS Configuration]
I -->|Input for| K[Ansible Inventory]
K -->|Used by| L[Ansible Playbook]
L -->|Configures| D
Setting Up Prometheus and Grafana
Prometheus Setup
Prometheus is deployed automatically via the Ansible playbook with these key components:
Binary Installation:
- Prometheus is downloaded and extracted to
/usr/local/bin/
- The binary is configured with proper permissions for the
prometheus
user
- Prometheus is downloaded and extracted to
Configuration:
- Main config file at
/etc/prometheus/prometheus.yml
- Scrape configurations for various targets:
- Prometheus itself
- Node Exporter (system metrics)
- Blackbox Exporter (endpoint availability)
- DORA Metrics Exporter
- Main config file at
Service Management:
- Systemd service file at
/etc/systemd/system/prometheus.service
- Data storage at
/var/lib/prometheus
- Automatic startup and monitoring
- Systemd service file at
Alert Rules:
- Alert definitions in
/etc/prometheus/rules/
- Separate files for different components (node_exporter_alerts.yml, blackbox_alerts.yml, dora_alerts.yml)
- Alert definitions in
Grafana Setup
Grafana is installed and configured through these steps:
Package Installation:
- APT repository added
- Latest Grafana package installed
Configuration:
- Default configuration at
/etc/grafana/grafana.ini
- Service managed by systemd
- Data stored at
/var/lib/grafana
- Default configuration at
Web Access:
- Available via Nginx reverse proxy
- SSL-secured connection
Setting Up Grafana Dashboards
After deployment, you'll need to configure your Grafana dashboards:
Access Grafana:
- Open your browser to https://grafana.yourdomain.com
- Default credentials are admin/admin
- You'll be prompted to set a new password
Add Prometheus as a Data Source:
- Click the gear icon (Configuration) → Data Sources
- Click "Add data source" and select "Prometheus"
- Set the URL to http://localhost:9090
- Set Access to "Server (default)"
- Click "Save & Test"
Import Pre-built Dashboards:
- Click "+" → "Import"
- Enter dashboard ID:
- Node Exporter: 1860
- Blackbox Exporter: 7587
- DORA Metrics: Import from
dora_metrics_grafana_dashboard.json
- Select Prometheus as the data source
- Click "Import"
DORA Metrics in Detail
Our monitoring setup includes tracking of DORA metrics, which are industry-standard measures of development team performance:
The DORA Metrics dashboard provides key DevOps Research and Assessment (DORA) metrics to measure software delivery performance:
Deployment Frequency (DF):
- Measures how often you successfully release to production
- In the example, we see 1.90 deployments/day
- Higher frequency indicates more agile delivery
Lead Time for Changes (LTC):
- Time from code commit to production deployment
- Measures development efficiency and pipeline speed
- The example shows 0.00635 hours (about 23 seconds)
Change Failure Rate (CFR):
- Percentage of deployments causing a failure in production
- Measures code quality and testing effectiveness
- The example shows 58.2% which would be considered high
Mean Time to Restore (MTTR):
- How quickly service is restored after an incident
- Measures operational efficiency
- The example shows 4.71 hours
These metrics are collected using the DORA Metrics Collector integrated into our monitoring stack.
System Monitoring with Node Exporter
The Node Exporter dashboard provides comprehensive system metrics:
Key metrics displayed include:
- CPU utilization (2.3% in the example)
- Memory usage (27.4% RAM used)
- Disk usage (20.6% root filesystem)
- System load, uptime, and resource consumption trends
This dashboard helps identify system bottlenecks and resource constraints.
Endpoint Monitoring with Blackbox Exporter
The Blackbox Exporter dashboard monitors endpoint availability and performance:
Key features include:
- Website status (UP/DOWN)
- Response time monitoring (17.2ms average probe duration)
- SSL certificate validity (2 months, 4 weeks, 1 day remaining)
- HTTP status codes (200 in the example)
- DNS lookup performance (3.32ms)
This dashboard helps detect availability issues and performance degradation of web endpoints.
Project Structure
project-root/
├── terraform/ # Infrastructure as Code
│ ├── main.tf # Main Terraform configuration
│ ├── variables.tf # Input variables
│ ├── outputs.tf # Output values
│ └── modules/ # Reusable modules
│
├── monitoring_stack/ # Ansible role for monitoring
│ ├── inventory.ini # Server inventory file
│ ├── playbook.yml # Main Ansible playbook
│ └── roles/
│ └── monitoring_stack/ # Monitoring stack role
│ ├── defaults/ # Default variables
│ ├── files/ # Static files
│ ├── handlers/ # Service handlers
│ ├── tasks/ # Task definitions
│ └── templates/ # Configuration templates
│
├── node-app/ # Node.js application
│ ├── src/ # Source code
│ ├── tests/ # Unit and integration tests
│ ├── Dockerfile # Container definition
│ └── .github/workflows/ # CI/CD pipeline definitions
│
└── README.md # Project documentation
Step-by-Step Deployment Guide
1. Infrastructure Provisioning with Terraform
Before deploying the monitoring stack, you need to provision the infrastructure using Terraform:
# Initialize Terraform
cd terraform
terraform init
# Preview the changes
terraform plan -var="vm_size=Standard_D2s_v3" -var="location=eastus"
# Apply the changes
terraform apply -var="vm_size=Standard_D2s_v3" -var="location=eastus"
# Take note of the outputs
terraform output vm_public_ip
Example Terraform configuration (main.tf
):
provider "azurerm" {
features {}
}
resource "azurerm_resource_group" "monitoring" {
name = "monitoring-resources"
location = var.location
}
resource "azurerm_virtual_network" "monitoring" {
name = "monitoring-network"
address_space = ["10.0.0.0/16"]
location = azurerm_resource_group.monitoring.location
resource_group_name = azurerm_resource_group.monitoring.name
}
resource "azurerm_subnet" "monitoring" {
name = "internal"
resource_group_name = azurerm_resource_group.monitoring.name
virtual_network_name = azurerm_virtual_network.monitoring.name
address_prefixes = ["10.0.2.0/24"]
}
# Network security group
resource "azurerm_network_security_group" "monitoring" {
name = "monitoring-nsg"
location = azurerm_resource_group.monitoring.location
resource_group_name = azurerm_resource_group.monitoring.name
security_rule {
name = "SSH"
priority = 1001
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "22"
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "HTTP"
priority = 1002
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "80"
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "HTTPS"
priority = 1003
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "443"
source_address_prefix = "*"
destination_address_prefix = "*"
}
}
# Virtual machine
resource "azurerm_linux_virtual_machine" "monitoring" {
name = "monitoring-vm"
resource_group_name = azurerm_resource_group.monitoring.name
location = azurerm_resource_group.monitoring.location
size = var.vm_size
admin_username = "azureuser"
network_interface_ids = [
azurerm_network_interface.monitoring.id,
]
admin_ssh_key {
username = "azureuser"
public_key = file("~/.ssh/id_rsa.pub")
}
os_disk {
caching = "ReadWrite"
storage_account_type = "Standard_LRS"
}
source_image_reference {
publisher = "Canonical"
offer = "0001-com-ubuntu-server-jammy"
sku = "22_04-lts"
version = "latest"
}
}
# Output the public IP
output "vm_public_ip" {
value = azurerm_public_ip.monitoring.ip_address
}
2. CI/CD for Node.js Application
The Node.js application is deployed using a CI/CD pipeline with GitHub Actions:
# .github/workflows/main.yml
name: Node.js CI/CD Pipeline
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Use Node.js
uses: actions/setup-node@v3
with:
node-version: '18.x'
- name: Install dependencies
run: npm ci
- name: Lint code
run: npm run lint
- name: Run tests
run: npm test
- name: Build application
run: npm run build
- name: Package application
run: |
tar -czf app.tar.gz dist/ package.json package-lock.json
- name: Upload artifact
uses: actions/upload-artifact@v3
with:
name: app-package
path: app.tar.gz
deploy:
needs: build
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/download-artifact@v3
with:
name: app-package
- name: Deploy to production
uses: appleboy/ssh-action@master
with:
host: ${{ secrets.HOST }}
username: ${{ secrets.USERNAME }}
key: ${{ secrets.SSH_KEY }}
script: |
mkdir -p /opt/node-app
mv /path/to/app.tar.gz /opt/node-app/
cd /opt/node-app
tar -xzf app.tar.gz
npm install --production
pm2 restart app || pm2 start dist/index.js --name app
- name: Update DORA metrics
run: |
curl -X POST https://dora.yourdomain.com/api/deployment \
-H "Content-Type: application/json" \
-d '{"service":"node-app","version":"${{ github.sha }}","status":"success"}'
3. Monitoring Stack Deployment with Ansible
Finally, deploy the monitoring stack using Ansible:
# Update inventory with VM IP from Terraform
echo "monitor.yourdomain.com ansible_host=$(terraform output -raw vm_public_ip)" > monitoring_stack/inventory.ini
# Run the playbook
cd monitoring_stack
ansible-playbook -i inventory.ini playbook.yml
Troubleshooting Guide
Service Not Starting
Check the service status and logs:
systemctl status service_name
journalctl -u service_name -n 100
Common issues:
- Permission problems: Check ownership of directories
- Configuration errors: Validate config files
- Port conflicts: Ensure required ports are available
SSL Certificate Issues
If you encounter SSL certificate problems:
# Check certificate status
certbot certificates
# Manually renew certificates
certbot renew --dry-run
Prometheus Not Scraping Metrics
Verify target accessibility:
- curl http://localhost:9100/metrics
- curl http://localhost:9115/metrics
Check Prometheus configuration:
promtool check config /etc/prometheus/prometheus.yml
Conclusion
This monitoring stack provides a robust, enterprise-grade solution that gives you visibility into your infrastructure and applications. By leveraging Terraform for infrastructure provisioning, Ansible for configuration management, and integrating the DORA Metrics Collector, you can easily deploy and maintain this stack across multiple environments with minimal effort.
Key benefits:
- Comprehensive Monitoring: System, network, and application metrics
- Proactive Alerting: Immediate notification of issues
- Secure Access: SSL encryption and user authentication
- Automated Deployment: Consistent and reproducible setup
- Scalable Architecture: Easy to extend for additional services
Contributing
Interested in contributing to this project? Check out the devops-monitoring-stack repository on GitHub. Your contributions, bug reports, and feature requests are welcome!
References
Subscribe to my newsletter
Read articles from Tochukwu Onyeamah directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
