Smart Proxmox Node Exporter Monitoring

A comprehensive monitoring solution for Proxmox VE environments that automatically detects and monitors available system components.

The Problem We Solved

Managing monitoring in diverse Proxmox homelab environments typically requires juggling multiple tools: node_exporter for basic system stats, nvidia_gpu_exporter for NVIDIA cards, custom scripts for AMD GPUs, zfs_exporter for storage metrics, manual smartctl parsing for disk health, and temperature monitoring via lm-sensors. Each tool requires separate configuration, installation, and maintenance.

What We Built

The Smart Adaptive Proxmox Node Exporter is a single Python application that intelligently detects available hardware and software components, automatically adapting its monitoring capabilities without manual configuration. It provides comprehensive metrics for Proxmox environments, scaling from single-node homelabs to multi-node clusters.

How It Works

Intelligent Hardware Detection

The exporter performs automatic discovery on startup:

def _detect_features(self):
    """Detect available system features"""
    # GPU Detection - Multi-vendor support
    if shutil.which('nvidia-smi'):
        result = subprocess.run(['nvidia-smi', '-L'], capture_output=True, text=True, timeout=2)
        if result.returncode == 0 and 'GPU' in result.stdout:
            self.features['nvidia_gpu'] = True

    # AMD GPU via ROCm or sysfs
    if self._detect_amd_gpu_sysfs():
        self.features['amd_gpu'] = True

    # ZFS pools
    if shutil.which('zpool'):
        result = subprocess.run(['zpool', 'list'], capture_output=True, text=True, timeout=2)
        if result.returncode == 0:
            self.features['zfs'] = True

Zero Configuration Deployment

Installation requires no configuration files:

# Download and run the installer
curl -fsSL https://data.lazarev.cloud/install-proxmox-exporter.sh | bash
# The exporter is now running on port 9101 with auto-detected features

Multi-Vendor GPU Support

The exporter supports all major GPU vendors:

NVIDIA GPUs (via nvidia-smi): Temperature, utilization, memory usage, power draw, clock speeds, fan speeds, PCIe information
AMD GPUs (via ROCm and sysfs): Temperature, utilization, VRAM usage, power consumption
Intel GPUs (via sysfs): Basic metrics for discrete Intel graphics

Proxmox-Native Integration

Unlike generic exporters, the system understands Proxmox VE components:

def collect_vm_metrics(self):
    """Collect VM and container metrics"""
    # QEMU VMs
    if self.features['qemu_vms']:
        result = subprocess.run(['qm', 'list'], capture_output=True, text=True, timeout=5)
        for line in result.stdout.split('\n')[1:]:
            if line.strip():
                parts = line.split()
                vmid, name, status = parts[0], parts[1], parts[2]
                is_running = 1 if status == 'running' else 0
                self.vm_status.labels(vmid=vmid, name=name, type='qemu').set(is_running)

Performance Characteristics

The exporter operates efficiently across different environment sizes:

Environment Type	Memory Usage	CPU Usage	Collection Time
Minimal Setup (4C/8GB/2 disks)	42 MB	0.3%	150ms
Typical Homelab (16C/32GB/8 disks/1 GPU)	58 MB	0.8%	400ms
Large Setup (32C/128GB/20 disks/4 GPUs/20 VMs)	95 MB	2.1%	1.2s

Smart Collection Strategies

The exporter adapts collection frequency based on metric types:

def collect_all_metrics(self):
    # Fast metrics collected every cycle (15s)
    self.collect_base_metrics()          # CPU, memory, network
    self.collect_temperature_metrics()   # Temperature sensors
    self.collect_zfs_metrics()          # ZFS stats

    # Slower metrics with reduced frequency
    if self._should_collect_smart():     # Every 5 minutes
        self.collect_smart_metrics()

    if self._should_collect_detailed():  # Every 30 seconds
        self.collect_gpu_metrics()
        self.collect_vm_metrics()

Current Capabilities

Hardware Monitoring

Multi-vendor GPU support: NVIDIA (via nvidia-smi), AMD (ROCm/sysfs), Intel GPUs
Temperature sensors: CPU, GPU, motherboard sensors via lm-sensors
Fan speeds and power consumption: Available through hardware sensors
SMART disk health: Comprehensive disk health monitoring for SSDs and HDDs
IPMI integration: Server hardware sensors where supported

Storage and Virtualization

ZFS native support: Pool health, ARC statistics, fragmentation metrics
Proxmox integration: Native QEMU VM and LXC container monitoring
Filesystem metrics: Comprehensive disk usage and I/O statistics
VM resource tracking: CPU, memory, and status monitoring per VM/container

System Intelligence

Automatic feature detection: Discovers available hardware and software
Performance optimization: Intelligent caching and collection strategies
Error resilience: Graceful handling of hardware failures and timeouts
Self-monitoring: Tracks its own performance and collection efficiency

Installation and Usage

Quick Installation

# On your Proxmox node
curl -fsSL https://data.lazarev.cloud/install-proxmox-exporter.sh | sudo bash

Integration with Prometheus

# prometheus.yml
scrape_configs:
  - job_name: 'proxmox-nodes'
    static_configs:
      - targets:
        - 'your-proxmox-host:9101'
    scrape_interval: 30s

Available Metrics

The exporter provides comprehensive metrics across multiple categories:

Always Available:

node_cpu_* - CPU metrics (usage, frequency, load)
node_memory_* - Memory metrics (total, free, available, swap)
node_filesystem_* - Filesystem metrics
node_disk_* - Disk I/O metrics
node_network_* - Network I/O and errors

Auto-Detected Features:

node_hwmon_temp_celsius - Temperature readings
node_gpu_* - Multi-vendor GPU metrics
node_zfs_* - ZFS ARC and pool metrics
pve_vm_* - VM/Container metrics
node_disk_smart_* - SMART disk health

Grafana Dashboard Integration

The project includes pre-built Grafana dashboards optimized for different monitoring approaches:

Comprehensive Overview Dashboard: System overview, hardware health, GPU monitoring, storage, virtualization, and performance metrics
Focused Views Dashboard: Specialized panels for specific use cases with streamlined layouts

Both dashboards provide complete visualization with panels covering all detected system components.

Architecture and Design

Feature Detection Engine

The exporter uses a sophisticated detection system that runs on startup to identify available hardware and software components, ensuring metrics are only collected for components that actually exist.

Prometheus-Native Design

All metrics follow Prometheus best practices with consistent labeling and naming conventions:

# Well-structured metric names
node_gpu_temp_celsius{gpu="0",name="GeForce RTX 4090",vendor="nvidia"}
node_zfs_arc_hit_ratio{pool="rpool"}
pve_vm_cpu_usage_percent{vmid="100",name="gitlab",type="qemu"}

Error Handling and Resilience

The system includes comprehensive error handling:

def _safe_subprocess(self, cmd, timeout=5, **kwargs):
    """Safe subprocess execution with timeout and error handling"""
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, 
                               timeout=timeout, **kwargs)
        return result
    except subprocess.TimeoutExpired:
        logger.warning(f"Command timeout: {' '.join(cmd)}")
        return None
    except Exception as e:
        logger.debug(f"Command failed: {' '.join(cmd)}: {e}")
        return None

Technical Implementation

Dependency Management

The installer intelligently handles dependencies:

# Try apt packages first (cleaner, more secure)
if apt-cache show python3-prometheus-client >/dev/null 2>&1; then
    apt install -y python3-prometheus-client python3-psutil
    PYBIN="/usr/bin/python3"
else
    # Fall back to virtual environment
    python3 -m venv /opt/proxmox-exporter/.venv
    /opt/proxmox-exporter/.venv/bin/pip install prometheus-client psutil
    PYBIN="/opt/proxmox-exporter/.venv/bin/python3"
fi

Performance Optimizations

Intelligent caching: Expensive operations (like SMART queries) are cached with appropriate TTL
Parallel collection: I/O-bound operations run in parallel to reduce collection time
Tiered collection frequency: Different metrics collected at different intervals based on volatility

Open Source

The Smart Adaptive Proxmox Node Exporter is open-sourced under the BSD 3-Clause license. The complete source code, installation scripts, and Grafana dashboards are available at https://github.com/Lazarev-Cloud/proxmox-prometheus-exporter for inspection, modification, and contribution.

Project Structure

Core exporter: Single Python file with comprehensive monitoring capabilities
Installation script: Intelligent installer with automatic dependency management
Grafana dashboards: Multiple pre-built visualization options
Documentation: Setup guides and troubleshooting information

Use Cases

The exporter serves various deployment scenarios:

Homelab Environments benefit from zero-configuration monitoring across diverse hardware setups, eliminating the complexity of managing multiple monitoring tools.

Development Infrastructure uses the VM monitoring integration to track resource usage across development pipelines and container workloads.

Small to Medium Enterprise deployments leverage the multi-node cluster support for unified monitoring without complex configuration management.

The Smart Adaptive Proxmox Node Exporter provides comprehensive, zero-configuration monitoring for Proxmox VE environments, automatically adapting to available hardware and software components while maintaining high performance and reliability.

Project Repository: https://github.com/Lazarev-Cloud/proxmox-prometheus-exporter

Smart Adaptive Proxmox Node Exporter: Intelligent Infrastructure Monitoring

Table of contents