Build Your Own Server Monitoring Tool

Introduction

In today's complex IT infrastructure, understanding server performance is not just a convenience—it's a necessity. This blog post explores the fundamental principles behind server monitoring and dives deep into how we built a comprehensive server statistics tool from scratch.

As engineers, we often rely on sophisticated monitoring tools without understanding their inner workings. By breaking down our approach to first principles, we'll gain insights into not just how to monitor servers, but why specific metrics matter and how they interrelate.

Understanding the Fundamentals of Server Monitoring

Why Monitor Servers?

At its core, server monitoring solves several critical problems:

Proactive Issue Detection: Identifying problems before they impact users
Performance Optimization: Finding bottlenecks that limit system capability
Capacity Planning: Understanding resource utilization trends
Security Oversight: Detecting unusual patterns that may indicate breaches

The foundation of effective monitoring lies in knowing which metrics truly matter. Let's break this down by examining the fundamental resources every server manages.

The Four Pillars of Server Resources

From first principles, every server manages four essential resources:

graph TD
    A[Server Resources] --> B[CPU]
    A --> C[Memory]
    A --> D[Disk]
    A --> E[Network]

    B --> B1[Processing Power]
    B --> B2[Task Scheduling]

    C --> C1[Data Storage]
    C --> C2[Application State]

    D --> D1[Persistent Storage]
    D --> D2[I/O Operations]

    E --> E1[Data Transfer]
    E --> E2[Communication]

Understanding how these resources interact and depend on each other is crucial. For example, insufficient memory can lead to excessive disk swapping, creating an I/O bottleneck that appears as a disk problem but originates from memory constraints.

Architecture of Our Monitoring Solution

Our server-stats tool follows a modular design pattern where each component focuses on monitoring a specific aspect of the system:

flowchart TD
    A[server-stats.sh] --> B[System Information]
    A --> C[Resource Monitoring]
    A --> D[Security Metrics]

    B --> B1[OS Version]
    B --> B2[System Uptime]
    B --> B3[Load Average]
    B --> B4[Current Date/Time]

    C --> C1[CPU Usage]
    C --> C2[Memory Utilization]
    C --> C3[Disk Space Analysis]
    C --> C4[Top Processes]

    D --> D1[Active User Sessions]
    D --> D2[Failed Login Attempts]
    D --> D3[Auth Log Analysis]

This architecture allows for:

Independent development of each module
Easy extensibility to add new metrics
Clear separation of concerns

Deep Dive: Implementation From First Principles

CPU Monitoring

From first principles, CPU usage is fundamentally about time allocation. When we measure CPU percentage, we're asking: "Of the total available CPU time, how much was spent on actual work versus waiting?"

The Linux kernel tracks CPU time in multiple categories:

user: Time spent running user space processes
nice: Time spent running niced processes (with adjusted priority)
system: Time spent in kernel operations
idle: Time when CPU had nothing to process
iowait: Time spent waiting for I/O operations
irq/softirq: Time handling interrupts

Our implementation extracts this data directly from /proc/stat:

function cpu_usage {
    echo "=== Total CPU Usage ==="
    cpu_info=$(grep 'cpu ' /proc/stat)
    user=$(echo $cpu_info | awk '{print $2}')
    nice=$(echo $cpu_info | awk '{print $3}')
    system=$(echo $cpu_info | awk '{print $4}')
    idle=$(echo $cpu_info | awk '{print $5}')

    total=$((user + nice + system + idle))
    used=$((user + nice + system))

    cpu_percentage=$((100 * used / total))

    echo "CPU Usage: $cpu_percentage%"
    echo ""
}

While simplified, this function captures the essence of CPU monitoring by calculating the ratio of active time (user + nice + system) to total time.

Memory Monitoring From First Principles

Memory is fundamentally about allocation of finite storage space. The key insight is understanding the difference between available memory, used memory, and how the system manages memory pressure.

In Linux, the free command provides this information in an accessible format:

function memory_usage {
    echo "=== Total Memory Usage ==="
    mem_info=$(free -m)

    total_mem=$(echo "$mem_info" | awk 'NR==2{print $2}')
    used_mem=$(echo "$mem_info" | awk 'NR==2{print $3}')

    free_mem=$((total_mem - used_mem))

    mem_percentage=$((100 * used_mem / total_mem))

    echo "Total Memory: ${total_mem}MB"
    echo "Used Memory: ${used_mem}MB"
    echo "Free Memory: ${free_mem}MB"
    echo "Memory Usage Percentage: ${mem_percentage}%"
    echo ""
}

It's worth noting that modern Linux kernels have sophisticated memory management that includes caching frequently used data. A more comprehensive analysis would distinguish between memory used by applications and memory used for cache, which can be released if needed.

Execution Flow

The overall execution flow of our server monitoring tool follows a sequential pattern:

sequenceDiagram
    participant User
    participant Script
    participant System

    User->>Script: Execute ./server-stats.sh
    Script->>System: Request date information
    System-->>Script: Return current date
    Script->>User: Display date

    Script->>System: Request OS version
    System-->>Script: Return OS details
    Script->>User: Display OS version

    Script->>System: Request uptime data
    System-->>Script: Return uptime
    Script->>User: Display uptime

    Script->>System: Check load average
    System-->>Script: Return load statistics
    Script->>User: Display load average

    Script->>System: Query user sessions
    System-->>Script: Return active sessions
    Script->>User: Display logged in users

    Script->>System: Check security logs
    System-->>Script: Return failed login count
    Script->>User: Display failed attempts

    Script->>System: Request CPU statistics
    System-->>Script: Return CPU data
    Script->>User: Display CPU usage

    Script->>System: Query memory allocation
    System-->>Script: Return memory statistics
    Script->>User: Display memory usage

    Script->>System: Request disk information
    System-->>Script: Return disk statistics
    Script->>User: Display disk usage

    Script->>System: Query resource-intensive processes
    System-->>Script: Return top processes
    Script->>User: Display top processes

Security Considerations

From first principles, system security involves detecting anomalies and unauthorized access attempts. Our tool incorporates basic security monitoring:

Active Sessions: Shows who is currently logged in, allowing administrators to identify unexpected users
Failed Login Attempts: A sudden increase in failed logins often indicates a brute force attack

function failed_login_attempts {
    echo "=== Failed Login Attempts ==="
    sudo cat /var/log/auth.log | grep 'Failed password' | wc -l
    echo ""
}

A more comprehensive solution would include:

Tracking login attempts by IP address
Monitoring for privilege escalation
Detecting unusual file system access patterns
Checking for modifications to critical system files

Extending The System: Future Directions

From our first principles approach, several enhancements naturally emerge:

mindmap
  root((Server Monitoring))
    Real-time Monitoring
      Continuous data collection
      Time-series visualization
    Alerting System
      Threshold-based alerts
      Anomaly detection
    Historical Analysis
      Performance trending
      Capacity forecasting
    Network Monitoring
      Bandwidth utilization
      Connection tracking
    Service Monitoring
      API health checks
      Database performance
    Distributed Systems
      Cross-server correlation
      Service mesh analysis

Conclusion

Building a server monitoring system from first principles reveals the fundamental relationships between computing resources and provides deeper insights into system behavior. Our server-stats tool, while simple, demonstrates the core concepts behind effective monitoring.

By understanding the why behind each metric, not just the how, we develop more intuitive insights into server performance and can make better-informed decisions about optimization, scaling, and troubleshooting.

The journey from raw system data to actionable intelligence begins with these fundamentals. As we build more sophisticated monitoring solutions, these principles remain the foundation upon which all effective observability is built.

About The Author

I'm a system engineer and DevOps enthusiast passionate about understanding complex systems from first principles. This project is part of the roadmap.sh learning path for server administration and monitoring.

For more information about server monitoring and best practices, visit roadmap.sh/projects/server-stats

Visit Github repo for the code, visit (https://github.com/kaalpanikh/server-stats)

Server Monitoring From First Principles: Building a Custom Server-Stats Tool

Table of contents