Smart Adaptive Proxmox Node Exporter: Intelligent Infrastructure Monitoring

Till LazarevTill Lazarev
6 min read

A comprehensive monitoring solution for Proxmox VE environments that automatically detects and monitors available system components.


The Problem We Solved

Managing monitoring in diverse Proxmox homelab environments typically requires juggling multiple tools: node_exporter for basic system stats, nvidia_gpu_exporter for NVIDIA cards, custom scripts for AMD GPUs, zfs_exporter for storage metrics, manual smartctl parsing for disk health, and temperature monitoring via lm-sensors. Each tool requires separate configuration, installation, and maintenance.

What We Built

The Smart Adaptive Proxmox Node Exporter is a single Python application that intelligently detects available hardware and software components, automatically adapting its monitoring capabilities without manual configuration. It provides comprehensive metrics for Proxmox environments, scaling from single-node homelabs to multi-node clusters.

How It Works

Intelligent Hardware Detection

The exporter performs automatic discovery on startup:

def _detect_features(self):
    """Detect available system features"""
    # GPU Detection - Multi-vendor support
    if shutil.which('nvidia-smi'):
        result = subprocess.run(['nvidia-smi', '-L'], capture_output=True, text=True, timeout=2)
        if result.returncode == 0 and 'GPU' in result.stdout:
            self.features['nvidia_gpu'] = True

    # AMD GPU via ROCm or sysfs
    if self._detect_amd_gpu_sysfs():
        self.features['amd_gpu'] = True

    # ZFS pools
    if shutil.which('zpool'):
        result = subprocess.run(['zpool', 'list'], capture_output=True, text=True, timeout=2)
        if result.returncode == 0:
            self.features['zfs'] = True

Zero Configuration Deployment

Installation requires no configuration files:

# Download and run the installer
curl -fsSL https://data.lazarev.cloud/install-proxmox-exporter.sh | bash
# The exporter is now running on port 9101 with auto-detected features

Multi-Vendor GPU Support

The exporter supports all major GPU vendors:

  • NVIDIA GPUs (via nvidia-smi): Temperature, utilization, memory usage, power draw, clock speeds, fan speeds, PCIe information

  • AMD GPUs (via ROCm and sysfs): Temperature, utilization, VRAM usage, power consumption

  • Intel GPUs (via sysfs): Basic metrics for discrete Intel graphics

Proxmox-Native Integration

Unlike generic exporters, the system understands Proxmox VE components:

def collect_vm_metrics(self):
    """Collect VM and container metrics"""
    # QEMU VMs
    if self.features['qemu_vms']:
        result = subprocess.run(['qm', 'list'], capture_output=True, text=True, timeout=5)
        for line in result.stdout.split('\n')[1:]:
            if line.strip():
                parts = line.split()
                vmid, name, status = parts[0], parts[1], parts[2]
                is_running = 1 if status == 'running' else 0
                self.vm_status.labels(vmid=vmid, name=name, type='qemu').set(is_running)

Performance Characteristics

The exporter operates efficiently across different environment sizes:

Environment TypeMemory UsageCPU UsageCollection Time
Minimal Setup (4C/8GB/2 disks)42 MB0.3%150ms
Typical Homelab (16C/32GB/8 disks/1 GPU)58 MB0.8%400ms
Large Setup (32C/128GB/20 disks/4 GPUs/20 VMs)95 MB2.1%1.2s

Smart Collection Strategies

The exporter adapts collection frequency based on metric types:

def collect_all_metrics(self):
    # Fast metrics collected every cycle (15s)
    self.collect_base_metrics()          # CPU, memory, network
    self.collect_temperature_metrics()   # Temperature sensors
    self.collect_zfs_metrics()          # ZFS stats

    # Slower metrics with reduced frequency
    if self._should_collect_smart():     # Every 5 minutes
        self.collect_smart_metrics()

    if self._should_collect_detailed():  # Every 30 seconds
        self.collect_gpu_metrics()
        self.collect_vm_metrics()

Current Capabilities

Hardware Monitoring

  • Multi-vendor GPU support: NVIDIA (via nvidia-smi), AMD (ROCm/sysfs), Intel GPUs

  • Temperature sensors: CPU, GPU, motherboard sensors via lm-sensors

  • Fan speeds and power consumption: Available through hardware sensors

  • SMART disk health: Comprehensive disk health monitoring for SSDs and HDDs

  • IPMI integration: Server hardware sensors where supported

Storage and Virtualization

  • ZFS native support: Pool health, ARC statistics, fragmentation metrics

  • Proxmox integration: Native QEMU VM and LXC container monitoring

  • Filesystem metrics: Comprehensive disk usage and I/O statistics

  • VM resource tracking: CPU, memory, and status monitoring per VM/container

System Intelligence

  • Automatic feature detection: Discovers available hardware and software

  • Performance optimization: Intelligent caching and collection strategies

  • Error resilience: Graceful handling of hardware failures and timeouts

  • Self-monitoring: Tracks its own performance and collection efficiency

Installation and Usage

Quick Installation

# On your Proxmox node
curl -fsSL https://data.lazarev.cloud/install-proxmox-exporter.sh | sudo bash

Integration with Prometheus

# prometheus.yml
scrape_configs:
  - job_name: 'proxmox-nodes'
    static_configs:
      - targets:
        - 'your-proxmox-host:9101'
    scrape_interval: 30s

Available Metrics

The exporter provides comprehensive metrics across multiple categories:

Always Available:

  • node_cpu_* - CPU metrics (usage, frequency, load)

  • node_memory_* - Memory metrics (total, free, available, swap)

  • node_filesystem_* - Filesystem metrics

  • node_disk_* - Disk I/O metrics

  • node_network_* - Network I/O and errors

Auto-Detected Features:

  • node_hwmon_temp_celsius - Temperature readings

  • node_gpu_* - Multi-vendor GPU metrics

  • node_zfs_* - ZFS ARC and pool metrics

  • pve_vm_* - VM/Container metrics

  • node_disk_smart_* - SMART disk health

Grafana Dashboard Integration

The project includes pre-built Grafana dashboards optimized for different monitoring approaches:

  • Comprehensive Overview Dashboard: System overview, hardware health, GPU monitoring, storage, virtualization, and performance metrics

  • Focused Views Dashboard: Specialized panels for specific use cases with streamlined layouts

Both dashboards provide complete visualization with panels covering all detected system components.

Architecture and Design

Feature Detection Engine

The exporter uses a sophisticated detection system that runs on startup to identify available hardware and software components, ensuring metrics are only collected for components that actually exist.

Prometheus-Native Design

All metrics follow Prometheus best practices with consistent labeling and naming conventions:

# Well-structured metric names
node_gpu_temp_celsius{gpu="0",name="GeForce RTX 4090",vendor="nvidia"}
node_zfs_arc_hit_ratio{pool="rpool"}
pve_vm_cpu_usage_percent{vmid="100",name="gitlab",type="qemu"}

Error Handling and Resilience

The system includes comprehensive error handling:

def _safe_subprocess(self, cmd, timeout=5, **kwargs):
    """Safe subprocess execution with timeout and error handling"""
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, 
                               timeout=timeout, **kwargs)
        return result
    except subprocess.TimeoutExpired:
        logger.warning(f"Command timeout: {' '.join(cmd)}")
        return None
    except Exception as e:
        logger.debug(f"Command failed: {' '.join(cmd)}: {e}")
        return None

Technical Implementation

Dependency Management

The installer intelligently handles dependencies:

# Try apt packages first (cleaner, more secure)
if apt-cache show python3-prometheus-client >/dev/null 2>&1; then
    apt install -y python3-prometheus-client python3-psutil
    PYBIN="/usr/bin/python3"
else
    # Fall back to virtual environment
    python3 -m venv /opt/proxmox-exporter/.venv
    /opt/proxmox-exporter/.venv/bin/pip install prometheus-client psutil
    PYBIN="/opt/proxmox-exporter/.venv/bin/python3"
fi

Performance Optimizations

  • Intelligent caching: Expensive operations (like SMART queries) are cached with appropriate TTL

  • Parallel collection: I/O-bound operations run in parallel to reduce collection time

  • Tiered collection frequency: Different metrics collected at different intervals based on volatility

Open Source

The Smart Adaptive Proxmox Node Exporter is open-sourced under the BSD 3-Clause license. The complete source code, installation scripts, and Grafana dashboards are available at https://github.com/Lazarev-Cloud/proxmox-prometheus-exporter for inspection, modification, and contribution.

Project Structure

  • Core exporter: Single Python file with comprehensive monitoring capabilities

  • Installation script: Intelligent installer with automatic dependency management

  • Grafana dashboards: Multiple pre-built visualization options

  • Documentation: Setup guides and troubleshooting information

Use Cases

The exporter serves various deployment scenarios:

Homelab Environments benefit from zero-configuration monitoring across diverse hardware setups, eliminating the complexity of managing multiple monitoring tools.

Development Infrastructure uses the VM monitoring integration to track resource usage across development pipelines and container workloads.

Small to Medium Enterprise deployments leverage the multi-node cluster support for unified monitoring without complex configuration management.


The Smart Adaptive Proxmox Node Exporter provides comprehensive, zero-configuration monitoring for Proxmox VE environments, automatically adapting to available hardware and software components while maintaining high performance and reliability.

Project Repository: https://github.com/Lazarev-Cloud/proxmox-prometheus-exporter

0
Subscribe to my newsletter

Read articles from Till Lazarev directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Till Lazarev
Till Lazarev