Server Monitoring From First Principles: Building a Custom Server-Stats Tool

Table of contents

Introduction
In today's complex IT infrastructure, understanding server performance is not just a convenience—it's a necessity. This blog post explores the fundamental principles behind server monitoring and dives deep into how we built a comprehensive server statistics tool from scratch.
As engineers, we often rely on sophisticated monitoring tools without understanding their inner workings. By breaking down our approach to first principles, we'll gain insights into not just how to monitor servers, but why specific metrics matter and how they interrelate.
Understanding the Fundamentals of Server Monitoring
Why Monitor Servers?
At its core, server monitoring solves several critical problems:
- Proactive Issue Detection: Identifying problems before they impact users
- Performance Optimization: Finding bottlenecks that limit system capability
- Capacity Planning: Understanding resource utilization trends
- Security Oversight: Detecting unusual patterns that may indicate breaches
The foundation of effective monitoring lies in knowing which metrics truly matter. Let's break this down by examining the fundamental resources every server manages.
The Four Pillars of Server Resources
From first principles, every server manages four essential resources:
graph TD
A[Server Resources] --> B[CPU]
A --> C[Memory]
A --> D[Disk]
A --> E[Network]
B --> B1[Processing Power]
B --> B2[Task Scheduling]
C --> C1[Data Storage]
C --> C2[Application State]
D --> D1[Persistent Storage]
D --> D2[I/O Operations]
E --> E1[Data Transfer]
E --> E2[Communication]
Understanding how these resources interact and depend on each other is crucial. For example, insufficient memory can lead to excessive disk swapping, creating an I/O bottleneck that appears as a disk problem but originates from memory constraints.
Architecture of Our Monitoring Solution
Our server-stats tool follows a modular design pattern where each component focuses on monitoring a specific aspect of the system:
flowchart TD
A[server-stats.sh] --> B[System Information]
A --> C[Resource Monitoring]
A --> D[Security Metrics]
B --> B1[OS Version]
B --> B2[System Uptime]
B --> B3[Load Average]
B --> B4[Current Date/Time]
C --> C1[CPU Usage]
C --> C2[Memory Utilization]
C --> C3[Disk Space Analysis]
C --> C4[Top Processes]
D --> D1[Active User Sessions]
D --> D2[Failed Login Attempts]
D --> D3[Auth Log Analysis]
This architecture allows for:
- Independent development of each module
- Easy extensibility to add new metrics
- Clear separation of concerns
Deep Dive: Implementation From First Principles
CPU Monitoring
From first principles, CPU usage is fundamentally about time allocation. When we measure CPU percentage, we're asking: "Of the total available CPU time, how much was spent on actual work versus waiting?"
The Linux kernel tracks CPU time in multiple categories:
- user: Time spent running user space processes
- nice: Time spent running niced processes (with adjusted priority)
- system: Time spent in kernel operations
- idle: Time when CPU had nothing to process
- iowait: Time spent waiting for I/O operations
- irq/softirq: Time handling interrupts
Our implementation extracts this data directly from /proc/stat
:
function cpu_usage {
echo "=== Total CPU Usage ==="
cpu_info=$(grep 'cpu ' /proc/stat)
user=$(echo $cpu_info | awk '{print $2}')
nice=$(echo $cpu_info | awk '{print $3}')
system=$(echo $cpu_info | awk '{print $4}')
idle=$(echo $cpu_info | awk '{print $5}')
total=$((user + nice + system + idle))
used=$((user + nice + system))
cpu_percentage=$((100 * used / total))
echo "CPU Usage: $cpu_percentage%"
echo ""
}
While simplified, this function captures the essence of CPU monitoring by calculating the ratio of active time (user + nice + system) to total time.
Memory Monitoring From First Principles
Memory is fundamentally about allocation of finite storage space. The key insight is understanding the difference between available memory, used memory, and how the system manages memory pressure.
In Linux, the free
command provides this information in an accessible format:
function memory_usage {
echo "=== Total Memory Usage ==="
mem_info=$(free -m)
total_mem=$(echo "$mem_info" | awk 'NR==2{print $2}')
used_mem=$(echo "$mem_info" | awk 'NR==2{print $3}')
free_mem=$((total_mem - used_mem))
mem_percentage=$((100 * used_mem / total_mem))
echo "Total Memory: ${total_mem}MB"
echo "Used Memory: ${used_mem}MB"
echo "Free Memory: ${free_mem}MB"
echo "Memory Usage Percentage: ${mem_percentage}%"
echo ""
}
It's worth noting that modern Linux kernels have sophisticated memory management that includes caching frequently used data. A more comprehensive analysis would distinguish between memory used by applications and memory used for cache, which can be released if needed.
Execution Flow
The overall execution flow of our server monitoring tool follows a sequential pattern:
sequenceDiagram
participant User
participant Script
participant System
User->>Script: Execute ./server-stats.sh
Script->>System: Request date information
System-->>Script: Return current date
Script->>User: Display date
Script->>System: Request OS version
System-->>Script: Return OS details
Script->>User: Display OS version
Script->>System: Request uptime data
System-->>Script: Return uptime
Script->>User: Display uptime
Script->>System: Check load average
System-->>Script: Return load statistics
Script->>User: Display load average
Script->>System: Query user sessions
System-->>Script: Return active sessions
Script->>User: Display logged in users
Script->>System: Check security logs
System-->>Script: Return failed login count
Script->>User: Display failed attempts
Script->>System: Request CPU statistics
System-->>Script: Return CPU data
Script->>User: Display CPU usage
Script->>System: Query memory allocation
System-->>Script: Return memory statistics
Script->>User: Display memory usage
Script->>System: Request disk information
System-->>Script: Return disk statistics
Script->>User: Display disk usage
Script->>System: Query resource-intensive processes
System-->>Script: Return top processes
Script->>User: Display top processes
Security Considerations
From first principles, system security involves detecting anomalies and unauthorized access attempts. Our tool incorporates basic security monitoring:
- Active Sessions: Shows who is currently logged in, allowing administrators to identify unexpected users
- Failed Login Attempts: A sudden increase in failed logins often indicates a brute force attack
function failed_login_attempts {
echo "=== Failed Login Attempts ==="
sudo cat /var/log/auth.log | grep 'Failed password' | wc -l
echo ""
}
A more comprehensive solution would include:
- Tracking login attempts by IP address
- Monitoring for privilege escalation
- Detecting unusual file system access patterns
- Checking for modifications to critical system files
Extending The System: Future Directions
From our first principles approach, several enhancements naturally emerge:
mindmap
root((Server Monitoring))
Real-time Monitoring
Continuous data collection
Time-series visualization
Alerting System
Threshold-based alerts
Anomaly detection
Historical Analysis
Performance trending
Capacity forecasting
Network Monitoring
Bandwidth utilization
Connection tracking
Service Monitoring
API health checks
Database performance
Distributed Systems
Cross-server correlation
Service mesh analysis
Conclusion
Building a server monitoring system from first principles reveals the fundamental relationships between computing resources and provides deeper insights into system behavior. Our server-stats tool, while simple, demonstrates the core concepts behind effective monitoring.
By understanding the why behind each metric, not just the how, we develop more intuitive insights into server performance and can make better-informed decisions about optimization, scaling, and troubleshooting.
The journey from raw system data to actionable intelligence begins with these fundamentals. As we build more sophisticated monitoring solutions, these principles remain the foundation upon which all effective observability is built.
About The Author
I'm a system engineer and DevOps enthusiast passionate about understanding complex systems from first principles. This project is part of the roadmap.sh learning path for server administration and monitoring.
For more information about server monitoring and best practices, visit roadmap.sh/projects/server-stats
Visit Github repo for the code, visit (https://github.com/kaalpanikh/server-stats)
Subscribe to my newsletter
Read articles from Nikhil Mishra directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Nikhil Mishra
Nikhil Mishra
I am a student studying in Mumbai University, learning DevOps, looking for opportunities to learn more things by gaining experience at prestigious institutions