Why I Stopped Relying on Pingdom and Built My Own Monitoring Stack

Karan SharmaKaran Sharma
4 min read

As developers, we've all been there. Your application is running smoothly in production, users are happy, and then suddenly - everything breaks. The worst part? You find out hours later when angry support tickets start flooding in. This exact scenario motivated me to build a comprehensive URL health monitoring system that could prevent such disasters.

The Problem

Most monitoring solutions are either too expensive for small teams or too simplistic for real-world needs. I wanted something that could:

  • Monitor multiple URLs continuously without manual intervention

  • Send intelligent alerts that don't spam my inbox

  • Provide historical data to identify patterns

  • Scale efficiently as we add more services

  • Integrate seamlessly with existing DevOps workflows

After evaluating existing solutions, I decided to build my own using Node.js, Redis, and a modern observability stack.

Architecture Decisions

Why BullMQ for Job Processing

The heart of any monitoring system is reliable job processing. I chose BullMQ over alternatives like Agenda or simple cron jobs for several reasons:

  • Persistence: Jobs survive server restarts

  • Observability: Built-in UI for monitoring job queues

  • Scalability: Easy horizontal scaling with multiple workers

  • Error Handling: Automatic retries and dead letter queues

Redis as the Primary Datastore

While many might reach for PostgreSQL or MongoDB, Redis made perfect sense for this use case:

  • Speed: Sub-millisecond data retrieval

  • Simple Data Model: URL statuses fit perfectly in Redis data structures

  • Built-in Expiration: Automatic cleanup of old data

  • Queue Backend: BullMQ requires Redis anyway

Implementation Highlights

Intelligent Alert System

One of the biggest challenges in monitoring is alert fatigue. Nobody wants to receive 50 emails when a service goes down for 10 minutes. My solution implements smart escalation:

const shouldAlert = (
  monitoredUrl.consecutiveFailures === 1 || // First failure
  monitoredUrl.consecutiveFailures === 3 || // After 3 consecutive
  monitoredUrl.consecutiveFailures % 10 === 0 // Every 10 failures
);

This approach ensures you're notified immediately when something breaks,and receive periodic reminders for ongoing issues without inbox spam.

Asynchronous Architecture

The system uses a clear separation between API requests and actual health checks:

  1. API Layer: Handles user requests and configuration

  2. Queue System: Manages job distribution and retry logic

  3. Worker Processes: Execute actual HTTP checks

  4. Scheduler: Ensures continuous monitoring via cron jobs

This design allows the system to handle hundreds of URLs without blocking user interactions.

Express with Prometheus Integration

For the API layer, I went with Express.js enhanced with Prometheus metrics. This combination provides:

  • Familiar API: Standard REST endpoints for easy integration

  • Metrics Collection: Custom metrics for response times and failure rates

  • Grafana Integration: Beautiful dashboards out of the box

Implementation Highlights

Real-time Data Visualization

The frontend dashboard uses Chart.js to display response time trends. The implementation refreshes data automatically and provides immediate visual feedback:

javascript

window.myChart = new Chart(ctx, {
  type: "line",
  data: {
    labels: labels,
    datasets: [{
      label: "Response Time (ms)",
      data: durations,
      borderColor: "#3b82f6",
      tension: 0.3,
      pointBackgroundColor: durations.map((d) =>
        d > 0 ? "#3b82f6" : "#ef4444"
      ),
    }],
  }
});

Key Features Breakdown

Flexible Monitoring Configuration

Each monitored URL can be configured independently:

  • Check Intervals: From 2 to 30 minutes

  • Expected Status Codes: Define what "healthy" means

  • Custom Alert Emails: Route alerts to the right team

  • Tagging System: Organize URLs by service or environment

Comprehensive Observability

The system exposes Prometheus metrics for integration with existing monitoring infrastructure:

  • Response time histograms

  • Success/failure counters

  • Queue processing statistics

  • Standard Node.js metrics

Email Alert System

Built on Nodemailer with Gmail integration, the alert system sends rich HTML emails containing:

  • URL status and error details

  • Response times and HTTP status codes

  • Consecutive failure counts

  • Recovery notifications

Docker Compose for Easy Deployment

The entire stack runs with a single command.

services:
  redis:
    image: redis:6
  prometheus:
    image: prom/prometheus
  grafana:
    image: grafana/grafana
  backend:
    build: .

This setup includes everything needed for production deployment: the application, Redis for data storage, Prometheus for metrics collection, and Grafana for visualization.

CI/CD Integration

The GitLab CI pipeline ensures code quality and reliability.

  • Linting: ESLint enforces consistent code style

  • Testing: Jest runs comprehensive test suites

  • Coverage: Tracks test coverage for quality assurance

  • Deployment: Automated deployment on successful builds

Lessons Learned

Error Handling is Critical

URL monitoring involves dealing with numerous failure modes: network timeouts, DNS resolution failures, server errors, and certificate issues. Robust error handling and logging made debugging production issues much easier.

Alert Fatigue is Real

My first implementation sent an email for every failure. Within a day of monitoring a flaky staging environment, I had hundreds of emails. The progressive alerting system was a game-changer.

Observability from Day One

Adding Prometheus metrics early paid dividends. Being able to visualize queue depth, processing times, and failure rates in Grafana helped optimize the system before performance became an issue.

Have you built similar monitoring solutions or faced production outage challenges? I'd love to hear about your experiences in the comments. The complete source code is available in the repository - contributions and feedback are always welcome.

0
Subscribe to my newsletter

Read articles from Karan Sharma directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Karan Sharma
Karan Sharma