Introduction

In today's cloud-native world, robust monitoring is critical for maintaining reliable systems. This guide walks you through implementing a comprehensive monitoring solution using industry-standard open-source tools. Our stack includes Prometheus for metrics collection, Grafana for visualization, Alertmanager for notifications, and various exporters for gathering system and application metrics.

The entire infrastructure is automated using Terraform for provisioning and Ansible for configuration management, with CI/CD pipelines for the Node.js application, making the entire solution repeatable, maintainable, and easy to deploy across multiple environments.

1. Prometheus - The Time-Series Database

Think of Prometheus as a highly sophisticated data collector and storage system. It's like having a super-powered spreadsheet that:

Records metrics (measurements) over time
Stores them in a time-series database
Allows you to query this data using a powerful language called PromQL
Can trigger alerts based on conditions you define

Real-world analogy: Imagine a smart thermostat that not only measures temperature but also:

Records temperature every minute
Stores historical data
Can tell you the average temperature over any time period
Alerts you if the temperature goes too high or low

2. Node Exporter - System Health Monitor

Node Exporter is like having a health check-up device for your server. It collects metrics about:

CPU usage (how hard your server is working)
Memory usage (how much RAM is being used)
Disk space (how full your storage is)
Network traffic (how much data is moving in and out)
System load (how many tasks are waiting to be processed)

Real-world analogy: Think of it as a car's dashboard that shows:

Engine temperature
Fuel level
Speed
Oil pressure
Battery status

3. Blackbox Exporter - Website Health Checker

Blackbox Exporter is like having a website monitoring service that:

Checks if your website is up and running
Measures how fast it responds
Verifies SSL certificates are valid
Monitors DNS resolution
Checks if specific endpoints are accessible

Real-world analogy: Imagine a security guard who:

Checks if the store is open
Verifies the entrance is accessible
Makes sure the security system is working
Monitors response times to customer requests

4. Alertmanager - Smart Notification System

Alertmanager is your intelligent notification manager that:

Receives alerts from Prometheus
Groups similar alerts together
Sends notifications to the right people
Prevents alert fatigue by managing how often alerts are sent
Routes different types of alerts to different channels

Real-world analogy: Think of it as a smart receptionist who:

Receives emergency calls
Decides which department should handle each issue
Groups similar problems together
Makes sure the right person gets notified
Prevents the same issue from waking up multiple people

5. DORA Metrics - DevOps Performance Analyzer

DORA (DevOps Research and Assessment) Metrics is like having a performance analytics system for your development team that measures:

How often you deploy code (Deployment Frequency)
How long it takes to get changes into production (Lead Time for Changes)
How quickly you can fix problems (Mean Time to Recovery)
How often deployments fail (Change Failure Rate)

Real-world analogy: Imagine a sports analytics system that tracks:

How many games a team plays (Deployment Frequency)
How long it takes to get new players ready (Lead Time)
How quickly they recover from injuries (MTTR)
How many games they lose due to mistakes (Failure Rate)

6. Grafana - Beautiful Dashboard Creator

Grafana is like having a customizable control center that:

Creates beautiful visualizations of your metrics
Combines data from multiple sources
Allows you to create custom dashboards
Provides real-time monitoring
Enables historical data analysis

Real-world analogy: Think of it as a modern car's infotainment system that:

Shows multiple gauges and graphs
Displays navigation, music, and vehicle status
Allows you to customize what information you see
Updates in real-time
Shows historical data about your trips

Or, let’s look at the example below for better context. Imagine a complete monitoring system like a modern hospital:

Prometheus is like the hospital's central monitoring system that collects all patient data
Node Exporter is like the vital signs monitor on each patient's bed
Blackbox Exporter is like the security system checking if all doors and emergency exits are working
Alertmanager is like the nurse's station that receives alerts and routes them to the right doctors
DORA Metrics is like the hospital's performance analytics department
Grafana is like the hospital's command center with multiple screens showing different aspects of hospital operations

When something goes wrong:

The vital signs monitor (Node Exporter) detects an issue
The central system (Prometheus) records it
The nurse's station (Alertmanager) receives the alert
The right doctor gets notified
The command center (Grafana) shows the current status
The analytics department (DORA Metrics) tracks how well the hospital handled the situation

This monitoring stack helps you:

Know when something is wrong before your users do
Understand why things went wrong
Track how well your system is performing
Measure your team's effectiveness
Make data-driven decisions about improvements

So with this understanding, let's dive into the architecture overview.

Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                        DevOps Toolchain                          │
│                                                                  │
│  ┌────────────┐    ┌────────────┐    ┌────────────────────────┐  │
│  │  Terraform │    │   Ansible  │    │    GitHub Actions      │  │
│  │ Infrastructure│  │Configuration│   │     CI/CD Pipeline     │  │
│  │  as Code    │───▶│ Management │───▶│                        │  │
│  └────────────┘    └────────────┘    └────────────────────────┘  │
│         │                │                      │                 │
└─────────┼────────────────┼──────────────────────┼─────────────────┘
          │                │                      │
          ▼                ▼                      ▼
┌──────────────────────────────────────────────────────────────────┐
│                          Azure Cloud                             │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │                   Virtual Machine                          │  │
│  │                                                            │  │
│  │  ┌────────────────┐        ┌───────────────────────────┐   │  │
│  │  │   Node.js App  │◀───────│   GitHub Webhook Trigger  │   │  │
│  │  │                │        │                           │   │  │
│  │  │ ┌────────────┐ │        └───────────────────────────┘   │  │
│  │  │ │PM2 Process │ │                                        │  │
│  │  │ │  Manager   │ │                                        │  │
│  │  │ └────────────┘ │                                        │  │
│  │  └────────┬───────┘                                        │  │
│  │           │         Exposes                                │  │
│  │           └─────────Metrics─────┐                          │  │
│  │                                 │                          │  │
│  │  ┌─────────────────────────────▼──────────────────────┐   │  │
│  │  │                  Monitoring Stack                   │   │  │
│  │  │                                                     │   │  │
│  │  │  ┌──────────────┐    ┌──────────────┐    ┌───────┐ │   │  │
│  │  │  │  Prometheus  │───▶│ Alertmanager │───▶│ Slack │ │   │  │
│  │  │  │ Time-series  │    │  Alert       │    │       │ │   │  │
│  │  │  │   Database   │◀───┤  Routing     │    └───────┘ │   │  │
│  │  │  └──────┬───────┘    └──────────────┘              │   │  │
│  │  │         │                                          │   │  │
│  │  │         │    ┌────────────────┐                    │   │  │
│  │  │         └───▶│  Exporters     │                    │   │  │
│  │  │              │                │                    │   │  │
│  │  │              │ ┌────────────┐ │                    │   │  │
│  │  │              │ │    Node    │ │                    │   │  │
│  │  │              │ │  Exporter  │ │                    │   │  │
│  │  │              │ └────────────┘ │                    │   │  │
│  │  │              │ ┌────────────┐ │                    │   │  │
│  │  │              │ │  Blackbox  │ │                    │   │  │
│  │  │              │ │  Exporter  │ │                    │   │  │
│  │  │              │ └────────────┘ │                    │   │  │
│  │  │              │ ┌────────────┐ │                    │   │  │
│  │  │              │ │    DORA    │ │                    │   │  │
│  │  │              │ │   Metrics  │◀┼────GitHub API      │   │  │
│  │  │              │ └────────────┘ │                    │   │  │
│  │  │              └────────────────┘                    │   │  │
│  │  └─────────────────────────────────────────────────────┘   │  │
│  │                                                            │  │
│  └────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────┘

Repository

All code for this project is available in the devops-monitoring-stack repository. Feel free to clone, fork, or contribute to the project!

git clone https://github.com/tochinicky/devops-monitoring-stack.git
cd devops-monitoring-stack

CI/CD Pipeline for Node.js Application

graph LR
    A[Developer] -->|Push| B[Git Repository]
    B -->|Trigger| C[GitHub Actions]

    subgraph "CI Pipeline"
        C -->|Run| D[Lint Code]
        C -->|Run| E[Unit Tests]
        C -->|Run| F[Build App]
        C -->|Run| G[Generate Artifacts]
    end

    subgraph "CD Pipeline"
        C -->|Deploy to| H[Staging]
        H -->|Automated Tests| I[Integration Tests]
        I -->|Success| J[Deploy to Production]
    end

    J -->|Deploy| K[Application Server]
    J -->|Update| L[DORA Metrics]
    L -->|Tracked in| M[Prometheus]
    M -->|Visualized in| N[Grafana]

Infrastructure Overview

graph TB
    A[Terraform Code] -->|Plan| B[Infrastructure Plan]
    B -->|Apply| C[Azure Resources]

    subgraph "Azure Resources"
        C -->|Create| D[Virtual Machine]
        C -->|Configure| E[Virtual Network]
        C -->|Set Up| F[Network Security Groups]
        C -->|Allocate| G[Public IP]
        C -->|Create| H[Storage Account]
    end

    C -->|Output| I[VM IP Address]
    C -->|Output| J[DNS Configuration]

    I -->|Input for| K[Ansible Inventory]
    K -->|Used by| L[Ansible Playbook]
    L -->|Configures| D

Setting Up Prometheus and Grafana

Prometheus Setup

Prometheus is deployed automatically via the Ansible playbook with these key components:

Binary Installation:
- Prometheus is downloaded and extracted to /usr/local/bin/
- The binary is configured with proper permissions for the prometheus user
Configuration:
- Main config file at /etc/prometheus/prometheus.yml
- Scrape configurations for various targets:
  - Prometheus itself
  - Node Exporter (system metrics)
  - Blackbox Exporter (endpoint availability)
  - DORA Metrics Exporter
Service Management:
- Systemd service file at /etc/systemd/system/prometheus.service
- Data storage at /var/lib/prometheus
- Automatic startup and monitoring
Alert Rules:
- Alert definitions in /etc/prometheus/rules/
- Separate files for different components (node_exporter_alerts.yml, blackbox_alerts.yml, dora_alerts.yml)

Grafana Setup

Grafana is installed and configured through these steps:

Package Installation:
- APT repository added
- Latest Grafana package installed
Configuration:
- Default configuration at /etc/grafana/grafana.ini
- Service managed by systemd
- Data stored at /var/lib/grafana
Web Access:
- Available via Nginx reverse proxy
- SSL-secured connection

Setting Up Grafana Dashboards

After deployment, you'll need to configure your Grafana dashboards:

Access Grafana:
- Open your browser to https://grafana.yourdomain.com
- Default credentials are admin/admin
- You'll be prompted to set a new password
Add Prometheus as a Data Source:
- Click the gear icon (Configuration) → Data Sources
- Click "Add data source" and select "Prometheus"
- Set the URL to http://localhost:9090
- Set Access to "Server (default)"
- Click "Save & Test"
Import Pre-built Dashboards:
- Click "+" → "Import"
- Enter dashboard ID:
  - Node Exporter: 1860
  - Blackbox Exporter: 7587
  - DORA Metrics: Import from dora_metrics_grafana_dashboard.json
- Select Prometheus as the data source
- Click "Import"

DORA Metrics in Detail

Our monitoring setup includes tracking of DORA metrics, which are industry-standard measures of development team performance:

DORA Metrics Dashboard

The DORA Metrics dashboard provides key DevOps Research and Assessment (DORA) metrics to measure software delivery performance:

Deployment Frequency (DF):
- Measures how often you successfully release to production
- In the example, we see 1.90 deployments/day
- Higher frequency indicates more agile delivery
Lead Time for Changes (LTC):
- Time from code commit to production deployment
- Measures development efficiency and pipeline speed
- The example shows 0.00635 hours (about 23 seconds)
Change Failure Rate (CFR):
- Percentage of deployments causing a failure in production
- Measures code quality and testing effectiveness
- The example shows 58.2% which would be considered high
Mean Time to Restore (MTTR):
- How quickly service is restored after an incident
- Measures operational efficiency
- The example shows 4.71 hours

These metrics are collected using the DORA Metrics Collector integrated into our monitoring stack.

System Monitoring with Node Exporter

The Node Exporter dashboard provides comprehensive system metrics:

Key metrics displayed include:

CPU utilization (2.3% in the example)
Memory usage (27.4% RAM used)
Disk usage (20.6% root filesystem)
System load, uptime, and resource consumption trends

This dashboard helps identify system bottlenecks and resource constraints.

Endpoint Monitoring with Blackbox Exporter

The Blackbox Exporter dashboard monitors endpoint availability and performance:

Key features include:

Website status (UP/DOWN)
Response time monitoring (17.2ms average probe duration)
SSL certificate validity (2 months, 4 weeks, 1 day remaining)
HTTP status codes (200 in the example)
DNS lookup performance (3.32ms)

This dashboard helps detect availability issues and performance degradation of web endpoints.

Project Structure

project-root/
├── terraform/                # Infrastructure as Code
│   ├── main.tf              # Main Terraform configuration
│   ├── variables.tf         # Input variables
│   ├── outputs.tf           # Output values
│   └── modules/             # Reusable modules
│
├── monitoring_stack/         # Ansible role for monitoring
│   ├── inventory.ini        # Server inventory file
│   ├── playbook.yml         # Main Ansible playbook
│   └── roles/
│       └── monitoring_stack/ # Monitoring stack role
│           ├── defaults/    # Default variables
│           ├── files/       # Static files
│           ├── handlers/    # Service handlers
│           ├── tasks/       # Task definitions
│           └── templates/   # Configuration templates
│
├── node-app/                 # Node.js application
│   ├── src/                 # Source code
│   ├── tests/               # Unit and integration tests
│   ├── Dockerfile           # Container definition
│   └── .github/workflows/   # CI/CD pipeline definitions
│
└── README.md                 # Project documentation

Step-by-Step Deployment Guide

1. Infrastructure Provisioning with Terraform

Before deploying the monitoring stack, you need to provision the infrastructure using Terraform:

# Initialize Terraform
cd terraform
terraform init

# Preview the changes
terraform plan -var="vm_size=Standard_D2s_v3" -var="location=eastus"

# Apply the changes
terraform apply -var="vm_size=Standard_D2s_v3" -var="location=eastus"

# Take note of the outputs
terraform output vm_public_ip

Example Terraform configuration (main.tf):

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "monitoring" {
  name     = "monitoring-resources"
  location = var.location
}

resource "azurerm_virtual_network" "monitoring" {
  name                = "monitoring-network"
  address_space       = ["10.0.0.0/16"]
  location            = azurerm_resource_group.monitoring.location
  resource_group_name = azurerm_resource_group.monitoring.name
}

resource "azurerm_subnet" "monitoring" {
  name                 = "internal"
  resource_group_name  = azurerm_resource_group.monitoring.name
  virtual_network_name = azurerm_virtual_network.monitoring.name
  address_prefixes     = ["10.0.2.0/24"]
}

# Network security group
resource "azurerm_network_security_group" "monitoring" {
  name                = "monitoring-nsg"
  location            = azurerm_resource_group.monitoring.location
  resource_group_name = azurerm_resource_group.monitoring.name

  security_rule {
    name                       = "SSH"
    priority                   = 1001
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "22"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "HTTP"
    priority                   = 1002
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "80"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "HTTPS"
    priority                   = 1003
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "443"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

# Virtual machine
resource "azurerm_linux_virtual_machine" "monitoring" {
  name                = "monitoring-vm"
  resource_group_name = azurerm_resource_group.monitoring.name
  location            = azurerm_resource_group.monitoring.location
  size                = var.vm_size
  admin_username      = "azureuser"
  network_interface_ids = [
    azurerm_network_interface.monitoring.id,
  ]

  admin_ssh_key {
    username   = "azureuser"
    public_key = file("~/.ssh/id_rsa.pub")
  }

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "Standard_LRS"
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-jammy"
    sku       = "22_04-lts"
    version   = "latest"
  }
}

# Output the public IP
output "vm_public_ip" {
  value = azurerm_public_ip.monitoring.ip_address
}

2. CI/CD for Node.js Application

The Node.js application is deployed using a CI/CD pipeline with GitHub Actions:

# .github/workflows/main.yml
name: Node.js CI/CD Pipeline

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3

    - name: Use Node.js
      uses: actions/setup-node@v3
      with:
        node-version: '18.x'

    - name: Install dependencies
      run: npm ci

    - name: Lint code
      run: npm run lint

    - name: Run tests
      run: npm test

    - name: Build application
      run: npm run build

    - name: Package application
      run: |
        tar -czf app.tar.gz dist/ package.json package-lock.json

    - name: Upload artifact
      uses: actions/upload-artifact@v3
      with:
        name: app-package
        path: app.tar.gz

  deploy:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'

    steps:
    - uses: actions/download-artifact@v3
      with:
        name: app-package

    - name: Deploy to production
      uses: appleboy/ssh-action@master
      with:
        host: ${{ secrets.HOST }}
        username: ${{ secrets.USERNAME }}
        key: ${{ secrets.SSH_KEY }}
        script: |
          mkdir -p /opt/node-app
          mv /path/to/app.tar.gz /opt/node-app/
          cd /opt/node-app
          tar -xzf app.tar.gz
          npm install --production
          pm2 restart app || pm2 start dist/index.js --name app

    - name: Update DORA metrics
      run: |
        curl -X POST https://dora.yourdomain.com/api/deployment \
          -H "Content-Type: application/json" \
          -d '{"service":"node-app","version":"${{ github.sha }}","status":"success"}'

3. Monitoring Stack Deployment with Ansible

Finally, deploy the monitoring stack using Ansible:

# Update inventory with VM IP from Terraform
echo "monitor.yourdomain.com ansible_host=$(terraform output -raw vm_public_ip)" > monitoring_stack/inventory.ini

# Run the playbook
cd monitoring_stack
ansible-playbook -i inventory.ini playbook.yml

Troubleshooting Guide

Service Not Starting

Check the service status and logs:

systemctl status service_name
journalctl -u service_name -n 100

Common issues:

Permission problems: Check ownership of directories
Configuration errors: Validate config files
Port conflicts: Ensure required ports are available

SSL Certificate Issues

If you encounter SSL certificate problems:

# Check certificate status
certbot certificates

# Manually renew certificates
certbot renew --dry-run

Prometheus Not Scraping Metrics

Verify target accessibility:

curl http://localhost:9100/metrics
curl http://localhost:9115/metrics

Check Prometheus configuration:

promtool check config /etc/prometheus/prometheus.yml

Conclusion

This monitoring stack provides a robust, enterprise-grade solution that gives you visibility into your infrastructure and applications. By leveraging Terraform for infrastructure provisioning, Ansible for configuration management, and integrating the DORA Metrics Collector, you can easily deploy and maintain this stack across multiple environments with minimal effort.

Key benefits:

Comprehensive Monitoring: System, network, and application metrics
Proactive Alerting: Immediate notification of issues
Secure Access: SSL encryption and user authentication
Automated Deployment: Consistent and reproducible setup
Scalable Architecture: Easy to extend for additional services

Contributing

Interested in contributing to this project? Check out the devops-monitoring-stack repository on GitHub. Your contributions, bug reports, and feature requests are welcome!

Building a Production-Grade Monitoring Stack with Ansible: A Complete Guide

Table of contents