Smart Adaptive Proxmox Node Exporter: Intelligent Infrastructure Monitoring

A comprehensive monitoring solution for Proxmox VE environments that automatically detects and monitors available system components.
The Problem We Solved
Managing monitoring in diverse Proxmox homelab environments typically requires juggling multiple tools: node_exporter
for basic system stats, nvidia_gpu_exporter
for NVIDIA cards, custom scripts for AMD GPUs, zfs_exporter
for storage metrics, manual smartctl
parsing for disk health, and temperature monitoring via lm-sensors
. Each tool requires separate configuration, installation, and maintenance.
What We Built
The Smart Adaptive Proxmox Node Exporter is a single Python application that intelligently detects available hardware and software components, automatically adapting its monitoring capabilities without manual configuration. It provides comprehensive metrics for Proxmox environments, scaling from single-node homelabs to multi-node clusters.
How It Works
Intelligent Hardware Detection
The exporter performs automatic discovery on startup:
def _detect_features(self):
"""Detect available system features"""
# GPU Detection - Multi-vendor support
if shutil.which('nvidia-smi'):
result = subprocess.run(['nvidia-smi', '-L'], capture_output=True, text=True, timeout=2)
if result.returncode == 0 and 'GPU' in result.stdout:
self.features['nvidia_gpu'] = True
# AMD GPU via ROCm or sysfs
if self._detect_amd_gpu_sysfs():
self.features['amd_gpu'] = True
# ZFS pools
if shutil.which('zpool'):
result = subprocess.run(['zpool', 'list'], capture_output=True, text=True, timeout=2)
if result.returncode == 0:
self.features['zfs'] = True
Zero Configuration Deployment
Installation requires no configuration files:
# Download and run the installer
curl -fsSL https://data.lazarev.cloud/install-proxmox-exporter.sh | bash
# The exporter is now running on port 9101 with auto-detected features
Multi-Vendor GPU Support
The exporter supports all major GPU vendors:
NVIDIA GPUs (via nvidia-smi): Temperature, utilization, memory usage, power draw, clock speeds, fan speeds, PCIe information
AMD GPUs (via ROCm and sysfs): Temperature, utilization, VRAM usage, power consumption
Intel GPUs (via sysfs): Basic metrics for discrete Intel graphics
Proxmox-Native Integration
Unlike generic exporters, the system understands Proxmox VE components:
def collect_vm_metrics(self):
"""Collect VM and container metrics"""
# QEMU VMs
if self.features['qemu_vms']:
result = subprocess.run(['qm', 'list'], capture_output=True, text=True, timeout=5)
for line in result.stdout.split('\n')[1:]:
if line.strip():
parts = line.split()
vmid, name, status = parts[0], parts[1], parts[2]
is_running = 1 if status == 'running' else 0
self.vm_status.labels(vmid=vmid, name=name, type='qemu').set(is_running)
Performance Characteristics
The exporter operates efficiently across different environment sizes:
Environment Type | Memory Usage | CPU Usage | Collection Time |
Minimal Setup (4C/8GB/2 disks) | 42 MB | 0.3% | 150ms |
Typical Homelab (16C/32GB/8 disks/1 GPU) | 58 MB | 0.8% | 400ms |
Large Setup (32C/128GB/20 disks/4 GPUs/20 VMs) | 95 MB | 2.1% | 1.2s |
Smart Collection Strategies
The exporter adapts collection frequency based on metric types:
def collect_all_metrics(self):
# Fast metrics collected every cycle (15s)
self.collect_base_metrics() # CPU, memory, network
self.collect_temperature_metrics() # Temperature sensors
self.collect_zfs_metrics() # ZFS stats
# Slower metrics with reduced frequency
if self._should_collect_smart(): # Every 5 minutes
self.collect_smart_metrics()
if self._should_collect_detailed(): # Every 30 seconds
self.collect_gpu_metrics()
self.collect_vm_metrics()
Current Capabilities
Hardware Monitoring
Multi-vendor GPU support: NVIDIA (via nvidia-smi), AMD (ROCm/sysfs), Intel GPUs
Temperature sensors: CPU, GPU, motherboard sensors via lm-sensors
Fan speeds and power consumption: Available through hardware sensors
SMART disk health: Comprehensive disk health monitoring for SSDs and HDDs
IPMI integration: Server hardware sensors where supported
Storage and Virtualization
ZFS native support: Pool health, ARC statistics, fragmentation metrics
Proxmox integration: Native QEMU VM and LXC container monitoring
Filesystem metrics: Comprehensive disk usage and I/O statistics
VM resource tracking: CPU, memory, and status monitoring per VM/container
System Intelligence
Automatic feature detection: Discovers available hardware and software
Performance optimization: Intelligent caching and collection strategies
Error resilience: Graceful handling of hardware failures and timeouts
Self-monitoring: Tracks its own performance and collection efficiency
Installation and Usage
Quick Installation
# On your Proxmox node
curl -fsSL https://data.lazarev.cloud/install-proxmox-exporter.sh | sudo bash
Integration with Prometheus
# prometheus.yml
scrape_configs:
- job_name: 'proxmox-nodes'
static_configs:
- targets:
- 'your-proxmox-host:9101'
scrape_interval: 30s
Available Metrics
The exporter provides comprehensive metrics across multiple categories:
Always Available:
node_cpu_*
- CPU metrics (usage, frequency, load)node_memory_*
- Memory metrics (total, free, available, swap)node_filesystem_*
- Filesystem metricsnode_disk_*
- Disk I/O metricsnode_network_*
- Network I/O and errors
Auto-Detected Features:
node_hwmon_temp_celsius
- Temperature readingsnode_gpu_*
- Multi-vendor GPU metricsnode_zfs_*
- ZFS ARC and pool metricspve_vm_*
- VM/Container metricsnode_disk_smart_*
- SMART disk health
Grafana Dashboard Integration
The project includes pre-built Grafana dashboards optimized for different monitoring approaches:
Comprehensive Overview Dashboard: System overview, hardware health, GPU monitoring, storage, virtualization, and performance metrics
Focused Views Dashboard: Specialized panels for specific use cases with streamlined layouts
Both dashboards provide complete visualization with panels covering all detected system components.
Architecture and Design
Feature Detection Engine
The exporter uses a sophisticated detection system that runs on startup to identify available hardware and software components, ensuring metrics are only collected for components that actually exist.
Prometheus-Native Design
All metrics follow Prometheus best practices with consistent labeling and naming conventions:
# Well-structured metric names
node_gpu_temp_celsius{gpu="0",name="GeForce RTX 4090",vendor="nvidia"}
node_zfs_arc_hit_ratio{pool="rpool"}
pve_vm_cpu_usage_percent{vmid="100",name="gitlab",type="qemu"}
Error Handling and Resilience
The system includes comprehensive error handling:
def _safe_subprocess(self, cmd, timeout=5, **kwargs):
"""Safe subprocess execution with timeout and error handling"""
try:
result = subprocess.run(cmd, capture_output=True, text=True,
timeout=timeout, **kwargs)
return result
except subprocess.TimeoutExpired:
logger.warning(f"Command timeout: {' '.join(cmd)}")
return None
except Exception as e:
logger.debug(f"Command failed: {' '.join(cmd)}: {e}")
return None
Technical Implementation
Dependency Management
The installer intelligently handles dependencies:
# Try apt packages first (cleaner, more secure)
if apt-cache show python3-prometheus-client >/dev/null 2>&1; then
apt install -y python3-prometheus-client python3-psutil
PYBIN="/usr/bin/python3"
else
# Fall back to virtual environment
python3 -m venv /opt/proxmox-exporter/.venv
/opt/proxmox-exporter/.venv/bin/pip install prometheus-client psutil
PYBIN="/opt/proxmox-exporter/.venv/bin/python3"
fi
Performance Optimizations
Intelligent caching: Expensive operations (like SMART queries) are cached with appropriate TTL
Parallel collection: I/O-bound operations run in parallel to reduce collection time
Tiered collection frequency: Different metrics collected at different intervals based on volatility
Open Source
The Smart Adaptive Proxmox Node Exporter is open-sourced under the BSD 3-Clause license. The complete source code, installation scripts, and Grafana dashboards are available at https://github.com/Lazarev-Cloud/proxmox-prometheus-exporter for inspection, modification, and contribution.
Project Structure
Core exporter: Single Python file with comprehensive monitoring capabilities
Installation script: Intelligent installer with automatic dependency management
Grafana dashboards: Multiple pre-built visualization options
Documentation: Setup guides and troubleshooting information
Use Cases
The exporter serves various deployment scenarios:
Homelab Environments benefit from zero-configuration monitoring across diverse hardware setups, eliminating the complexity of managing multiple monitoring tools.
Development Infrastructure uses the VM monitoring integration to track resource usage across development pipelines and container workloads.
Small to Medium Enterprise deployments leverage the multi-node cluster support for unified monitoring without complex configuration management.
The Smart Adaptive Proxmox Node Exporter provides comprehensive, zero-configuration monitoring for Proxmox VE environments, automatically adapting to available hardware and software components while maintaining high performance and reliability.
Project Repository: https://github.com/Lazarev-Cloud/proxmox-prometheus-exporter
Subscribe to my newsletter
Read articles from Till Lazarev directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
