As developers, we've all been there. Your application is running smoothly in production, users are happy, and then suddenly - everything breaks. The worst part? You find out hours later when angry support tickets start flooding in. This exact scenario motivated me to build a comprehensive URL health monitoring system that could prevent such disasters.

The Problem

Most monitoring solutions are either too expensive for small teams or too simplistic for real-world needs. I wanted something that could:

Monitor multiple URLs continuously without manual intervention
Send intelligent alerts that don't spam my inbox
Provide historical data to identify patterns
Scale efficiently as we add more services
Integrate seamlessly with existing DevOps workflows

After evaluating existing solutions, I decided to build my own using Node.js, Redis, and a modern observability stack.

Architecture Decisions

Why BullMQ for Job Processing

The heart of any monitoring system is reliable job processing. I chose BullMQ over alternatives like Agenda or simple cron jobs for several reasons:

Persistence: Jobs survive server restarts
Observability: Built-in UI for monitoring job queues
Scalability: Easy horizontal scaling with multiple workers
Error Handling: Automatic retries and dead letter queues

Redis as the Primary Datastore

While many might reach for PostgreSQL or MongoDB, Redis made perfect sense for this use case:

Speed: Sub-millisecond data retrieval
Simple Data Model: URL statuses fit perfectly in Redis data structures
Built-in Expiration: Automatic cleanup of old data
Queue Backend: BullMQ requires Redis anyway

Implementation Highlights

Intelligent Alert System

One of the biggest challenges in monitoring is alert fatigue. Nobody wants to receive 50 emails when a service goes down for 10 minutes. My solution implements smart escalation:

const shouldAlert = (
  monitoredUrl.consecutiveFailures === 1 || // First failure
  monitoredUrl.consecutiveFailures === 3 || // After 3 consecutive
  monitoredUrl.consecutiveFailures % 10 === 0 // Every 10 failures
);

This approach ensures you're notified immediately when something breaks,and receive periodic reminders for ongoing issues without inbox spam.

Asynchronous Architecture

The system uses a clear separation between API requests and actual health checks:

API Layer: Handles user requests and configuration
Queue System: Manages job distribution and retry logic
Worker Processes: Execute actual HTTP checks
Scheduler: Ensures continuous monitoring via cron jobs

This design allows the system to handle hundreds of URLs without blocking user interactions.

Express with Prometheus Integration

For the API layer, I went with Express.js enhanced with Prometheus metrics. This combination provides:

Familiar API: Standard REST endpoints for easy integration
Metrics Collection: Custom metrics for response times and failure rates
Grafana Integration: Beautiful dashboards out of the box

Implementation Highlights

Real-time Data Visualization

The frontend dashboard uses Chart.js to display response time trends. The implementation refreshes data automatically and provides immediate visual feedback:

javascript

window.myChart = new Chart(ctx, {
  type: "line",
  data: {
    labels: labels,
    datasets: [{
      label: "Response Time (ms)",
      data: durations,
      borderColor: "#3b82f6",
      tension: 0.3,
      pointBackgroundColor: durations.map((d) =>
        d > 0 ? "#3b82f6" : "#ef4444"
      ),
    }],
  }
});

Key Features Breakdown

Flexible Monitoring Configuration

Each monitored URL can be configured independently:

Check Intervals: From 2 to 30 minutes
Expected Status Codes: Define what "healthy" means
Custom Alert Emails: Route alerts to the right team
Tagging System: Organize URLs by service or environment

Comprehensive Observability

The system exposes Prometheus metrics for integration with existing monitoring infrastructure:

Response time histograms
Success/failure counters
Queue processing statistics
Standard Node.js metrics

Email Alert System

Built on Nodemailer with Gmail integration, the alert system sends rich HTML emails containing:

URL status and error details
Response times and HTTP status codes
Consecutive failure counts
Recovery notifications

Docker Compose for Easy Deployment

The entire stack runs with a single command.

services:
  redis:
    image: redis:6
  prometheus:
    image: prom/prometheus
  grafana:
    image: grafana/grafana
  backend:
    build: .

This setup includes everything needed for production deployment: the application, Redis for data storage, Prometheus for metrics collection, and Grafana for visualization.

CI/CD Integration

The GitLab CI pipeline ensures code quality and reliability.

Linting: ESLint enforces consistent code style
Testing: Jest runs comprehensive test suites
Coverage: Tracks test coverage for quality assurance
Deployment: Automated deployment on successful builds

Lessons Learned

Error Handling is Critical

URL monitoring involves dealing with numerous failure modes: network timeouts, DNS resolution failures, server errors, and certificate issues. Robust error handling and logging made debugging production issues much easier.

Alert Fatigue is Real

My first implementation sent an email for every failure. Within a day of monitoring a flaky staging environment, I had hundreds of emails. The progressive alerting system was a game-changer.

Observability from Day One

Adding Prometheus metrics early paid dividends. Being able to visualize queue depth, processing times, and failure rates in Grafana helped optimize the system before performance became an issue.

Have you built similar monitoring solutions or faced production outage challenges? I'd love to hear about your experiences in the comments. The complete source code is available in the repository - contributions and feedback are always welcome.

Why I Stopped Relying on Pingdom and Built My Own Monitoring Stack

The Problem

Architecture Decisions

Why BullMQ for Job Processing

Redis as the Primary Datastore

Implementation Highlights

Intelligent Alert System

Asynchronous Architecture

Express with Prometheus Integration

Implementation Highlights

Real-time Data Visualization

Key Features Breakdown

Flexible Monitoring Configuration

Comprehensive Observability

Email Alert System

Docker Compose for Easy Deployment

CI/CD Integration

Lessons Learned

Error Handling is Critical

Alert Fatigue is Real

Observability from Day One

Subscribe to my newsletter

Karan Sharma

Karan Sharma