Why I Stopped Relying on Pingdom and Built My Own Monitoring Stack


As developers, we've all been there. Your application is running smoothly in production, users are happy, and then suddenly - everything breaks. The worst part? You find out hours later when angry support tickets start flooding in. This exact scenario motivated me to build a comprehensive URL health monitoring system that could prevent such disasters.
The Problem
Most monitoring solutions are either too expensive for small teams or too simplistic for real-world needs. I wanted something that could:
Monitor multiple URLs continuously without manual intervention
Send intelligent alerts that don't spam my inbox
Provide historical data to identify patterns
Scale efficiently as we add more services
Integrate seamlessly with existing DevOps workflows
After evaluating existing solutions, I decided to build my own using Node.js, Redis, and a modern observability stack.
Architecture Decisions
Why BullMQ for Job Processing
The heart of any monitoring system is reliable job processing. I chose BullMQ over alternatives like Agenda or simple cron jobs for several reasons:
Persistence: Jobs survive server restarts
Observability: Built-in UI for monitoring job queues
Scalability: Easy horizontal scaling with multiple workers
Error Handling: Automatic retries and dead letter queues
Redis as the Primary Datastore
While many might reach for PostgreSQL or MongoDB, Redis made perfect sense for this use case:
Speed: Sub-millisecond data retrieval
Simple Data Model: URL statuses fit perfectly in Redis data structures
Built-in Expiration: Automatic cleanup of old data
Queue Backend: BullMQ requires Redis anyway
Implementation Highlights
Intelligent Alert System
One of the biggest challenges in monitoring is alert fatigue. Nobody wants to receive 50 emails when a service goes down for 10 minutes. My solution implements smart escalation:
const shouldAlert = (
monitoredUrl.consecutiveFailures === 1 || // First failure
monitoredUrl.consecutiveFailures === 3 || // After 3 consecutive
monitoredUrl.consecutiveFailures % 10 === 0 // Every 10 failures
);
This approach ensures you're notified immediately when something breaks,and receive periodic reminders for ongoing issues without inbox spam.
Asynchronous Architecture
The system uses a clear separation between API requests and actual health checks:
API Layer: Handles user requests and configuration
Queue System: Manages job distribution and retry logic
Worker Processes: Execute actual HTTP checks
Scheduler: Ensures continuous monitoring via cron jobs
This design allows the system to handle hundreds of URLs without blocking user interactions.
Express with Prometheus Integration
For the API layer, I went with Express.js enhanced with Prometheus metrics. This combination provides:
Familiar API: Standard REST endpoints for easy integration
Metrics Collection: Custom metrics for response times and failure rates
Grafana Integration: Beautiful dashboards out of the box
Implementation Highlights
Real-time Data Visualization
The frontend dashboard uses Chart.js to display response time trends. The implementation refreshes data automatically and provides immediate visual feedback:
javascript
window.myChart = new Chart(ctx, {
type: "line",
data: {
labels: labels,
datasets: [{
label: "Response Time (ms)",
data: durations,
borderColor: "#3b82f6",
tension: 0.3,
pointBackgroundColor: durations.map((d) =>
d > 0 ? "#3b82f6" : "#ef4444"
),
}],
}
});
Key Features Breakdown
Flexible Monitoring Configuration
Each monitored URL can be configured independently:
Check Intervals: From 2 to 30 minutes
Expected Status Codes: Define what "healthy" means
Custom Alert Emails: Route alerts to the right team
Tagging System: Organize URLs by service or environment
Comprehensive Observability
The system exposes Prometheus metrics for integration with existing monitoring infrastructure:
Response time histograms
Success/failure counters
Queue processing statistics
Standard Node.js metrics
Email Alert System
Built on Nodemailer with Gmail integration, the alert system sends rich HTML emails containing:
URL status and error details
Response times and HTTP status codes
Consecutive failure counts
Recovery notifications
Docker Compose for Easy Deployment
The entire stack runs with a single command.
services:
redis:
image: redis:6
prometheus:
image: prom/prometheus
grafana:
image: grafana/grafana
backend:
build: .
This setup includes everything needed for production deployment: the application, Redis for data storage, Prometheus for metrics collection, and Grafana for visualization.
CI/CD Integration
The GitLab CI pipeline ensures code quality and reliability.
Linting: ESLint enforces consistent code style
Testing: Jest runs comprehensive test suites
Coverage: Tracks test coverage for quality assurance
Deployment: Automated deployment on successful builds
Lessons Learned
Error Handling is Critical
URL monitoring involves dealing with numerous failure modes: network timeouts, DNS resolution failures, server errors, and certificate issues. Robust error handling and logging made debugging production issues much easier.
Alert Fatigue is Real
My first implementation sent an email for every failure. Within a day of monitoring a flaky staging environment, I had hundreds of emails. The progressive alerting system was a game-changer.
Observability from Day One
Adding Prometheus metrics early paid dividends. Being able to visualize queue depth, processing times, and failure rates in Grafana helped optimize the system before performance became an issue.
Have you built similar monitoring solutions or faced production outage challenges? I'd love to hear about your experiences in the comments. The complete source code is available in the repository - contributions and feedback are always welcome.
Subscribe to my newsletter
Read articles from Karan Sharma directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
