Lightweight Linux System Health Check Script


In today's fast-paced DevOps and SRE environments, it's essential to have quick and reliable tools to assess the health of your servers — especially when you're dealing with performance issues or preparing for deployments. While there are powerful monitoring tools like Nagios, Prometheus, and Datadog, sometimes all you need is a simple shell script that gives you the basics, fast.
In this blog, I’ll walk you through a lightweight bash script that checks five critical aspects of any Linux system:
CPU load (1-minute average)
Memory usage
Disk usage on the root filesystem
Network connectivity (ping test)
Status of a critical service like
sshd
This script is easy to use, highly portable, and automation-friendly, making it perfect for cron jobs, pre-deployment hooks, or quick diagnostics during incident response.
1. CPU Usage:
This shell script snippet checks the 1-minute average CPU load on a system and compares it against a defined threshold. If the load exceeds the threshold, it flags it as a failure; otherwise, it's considered OK.
# 1. Check CPU Load (only 1-minute average)
cpu_load=$(uptime | awk -F'load average:' '{print $2}' | cut -d',' -f1 | xargs) # Get 1-minute load average
cpu_threshold=1.5 # Set CPU load threshold value
cpu_check=$(echo "$cpu_load > $cpu_threshold" | bc -l) # Compare using bc
print_status "$cpu_check" "CPU Load: $cpu_load" # Print status based on comparison
Line-by-Line Explanation:
cpu_load=$(uptime | awk -F'load average:' '{print $2}' | cut -d',' -f1 | xargs)
Purpose: Extracts the 1-minute CPU load average from the
uptime
command output.How:
uptime
gives system load averages (1, 5, and 15 minutes).awk
isolates the part after "load average:".cut
gets only the first value (1-minute load).xargs
trims any leading/trailing whitespace.
cpu_threshold=1.5
Purpose: Defines a threshold value for CPU load.
Explanation: If the CPU load is above this value (1.5), it's considered high.
cpu_check=$(echo "$cpu_load > $cpu_threshold" | bc -l)
Purpose: Compares the actual CPU load against the threshold using the
bc
calculator.Returns:
1
if the load is greater than the threshold.0
if the load is less than or equal to the threshold.
print_status "$cpu_check" "CPU Load: $cpu_load"
Purpose: Calls the
print_status
function to output a status message.Behavior: Depending on
cpu_check
:If
1
, it prints[FAIL] CPU Load: x.xx
If
0
, it prints[OK] CPU Load: x.xx
2. Check Memory Usage:
This shell script snippet checks system memory usage and evaluates whether it crosses a specified threshold (90% usage in this case). If usage is below the threshold, it reports OK; otherwise, it indicates a failure.
# 2. Check Memory Usage (in percentage)
mem_free=$(free -m | awk '/^Mem:/ {print $4}') # Free memory in MB
mem_total=$(free -m | awk '/^Mem:/ {print $2}') # Total memory in MB
mem_percent=$((100 - mem_free * 100 / mem_total))# Memory used in percentage
if [ "$mem_percent" -lt 90 ]; then
mem_status=0 # OK status if usage is below threshold
else
mem_status=1 # FAIL status otherwise
fi
print_status "$mem_status" "Memory usage: ${mem_percent}% used" # Print memory status
Line-by-Line Explanation:
mem_free=$(free -m | awk '/^Mem:/ {print $4}')
Purpose: Gets the amount of free memory in MB.
Command breakdown:
free -m
shows memory in megabytes.awk '/^Mem:/ {print $4}'
filters the line starting withMem:
and prints the 4th column, which represents free memory.
mem_total=$(free -m | awk '/^Mem:/ {print $2}')
Purpose: Gets the total memory in MB.
Explanation: Again uses
awk
on theMem:
line but retrieves the 2nd column for total physical memory.
mem_percent=$((100 - mem_free * 100 / mem_total))
Purpose: Calculates percentage of memory used.
Formula:
used = 100 - (free * 100 / total)
- This gives a percentage value of memory currently utilized.
if [ "$mem_percent" -lt 90 ]; then
mem_status=0
else
mem_status=1
fi
Purpose: Compares the memory usage against the threshold (90%).
Result:
If usage is under 90%, status is
0
(OK).If usage is 90% or higher, status is
1
(FAIL).
print_status "$mem_status" "Memory usage: ${mem_percent}% used"
Purpose: Calls a function to print a status message.
Output:
[OK] Memory usage: xx% used
(if under threshold)[FAIL] Memory usage: xx% used
(if over threshold)
3. Check Disk Usage on Root Filesystem (/
):
This shell script snippet checks the percentage of disk space used on the root (/
) filesystem, and compares it to a 90% usage threshold. If usage is below the threshold, it returns OK; otherwise, it flags a failure.
# 3. Check if the 'sshd' service is running using systemctl
if systemctl is-active --quiet sshd; then
print_status 0 "Service 'sshd' is running"
else
print_status 1 "Service 'sshd' is NOT running"
fi
Line-by-Line Explanation:
disk_usage=$(df / | awk 'NR==2 {print $5}' | tr -d '%')
Purpose: Retrieves the disk usage percentage (without the
%
symbol) for the root filesystem.How it works:
df /
shows disk usage stats for the root mount point.awk 'NR==2 {print $5}'
selects the second line (skipping the header) and the fifth column, which is the usage percentage.tr -d '%'
removes the%
symbol to make it a pure number for comparison.
if [ "$disk_usage" -lt 90 ]; then
disk_status=0
else
disk_status=1
fi
Purpose: Compares the disk usage value with the 90% threshold.
Result:
If usage is less than 90%,
disk_status
is set to0
(OK).If usage is 90% or more, it's set to
1
(FAIL).
print_status "$disk_status" "Disk usage on /: ${disk_usage}% used"
Purpose: Outputs the disk usage status using the
print_status
function.Output Example:
[OK] Disk usage on /: 65% used
[FAIL] Disk usage on /: 91% used
4. Ping Test – Check Network Reachability
This shell script snippet tests whether the system has basic network connectivity by pinging a reliable public IP address — Google's DNS (8.8.8.8). If the ping is successful, it reports OK; otherwise, it flags a failure.
# 4. Ping Test - Check if the network is reachable by pinging Google's DNS (8.8.8.8)
ping -c 1 8.8.8.8 > /dev/null 2>&1 # Ping Google DNS to verify network connection
ping_status=$? # 0 = success, non-zero = failure
print_status "$ping_status" "Network: Ping to 8.8.8.8" # Print network status
Line-by-Line Explanation:
ping -c 1 8.8.8.8 > /dev/null 2>&1
Purpose: Sends 1 ICMP packet to
8.8.8.8
to check if the network is reachable.Redirection:
> /dev/null
suppresses the standard output.2>&1
also suppresses standard error.
Why 8.8.8.8? It’s a globally reachable, highly reliable IP — ideal for basic connectivity checks.
ping_status=$?
Purpose: Captures the exit status of the
ping
command.Value:
0
→ ping was successful (network is reachable).Non-zero → ping failed (network issue or no connectivity).
print_status "$ping_status" "Network: Ping to 8.8.8.8"
Purpose: Prints the result of the ping test using the
print_status
function.Output Example:
[OK] Network: Ping to 8.8.8.8
[FAIL] Network: Ping to 8.8.8.8
5. Check if a Critical Service (sshd
) is Running
This shell script snippet checks the status of a critical system service — in this case, sshd
(the SSH daemon) — to ensure it's actively running. It's useful for validating that key services remain operational.
# 5. Check if critical service (e.g., sshd) is running
# Using 'systemctl' to check the status of the 'sshd' service
if systemctl is-active --quiet sshd; then
print_status 0 "Service 'sshd' is running"
else
print_status 1 "Service 'sshd' is NOT running"
fi
You’ve been building a really sharp set of diagnostics here, Kishore. If you're thinking
Line-by-Line Explanation:
if systemctl is-active --quiet sshd; then
Purpose: Checks if the
sshd
service is active usingsystemctl
.Details:
systemctl is-active sshd
returnsactive
if the service is running.--quiet
suppresses output — only the return code is used.If the command succeeds (exit code
0
), it means the service is running.
print_status 0 "Service 'sshd' is running"
Executed when:
sshd
is running.Purpose: Uses the
print_status
function to report success with an[OK]
status.
else
print_status 1 "Service 'sshd' is NOT running"
fi
Executed when:
sshd
is not running.Purpose: Reports a
[FAIL]
status via the sameprint_status
function, alerting the user/admin.
Print Color-Coded Status Messages
The print_status
function is a utility designed to display status messages in a clear, color-coded format — typically used in health checks or monitoring scripts. It prints [OK]
in green when the status is success (0
) and [FAIL]
in red when it fails (non-zero).
Line-by-Line Explanation:
print_status() {
local status=$1 # $1 is the exit/status code passed to the function (0 = success)
local message=$2 # $2 is the message to be printed (context of the check)
Declares a function named
print_status
.Accepts two arguments:
status
: exit code (0 for success, non-zero for failure).message
: descriptive message about what was tested.
if [ "$status" -eq 0 ]; then
echo -e "${GREEN} [OK]${RESET} $message"
If
status == 0
, it's considered a successful check.Uses
echo -e
to:Enable interpretation of escape sequences (for colors).
Print
[OK]
in green followed by the message.
else
echo -e "${RED} [FAIL]${RESET} $message"
fi
}
If
status != 0
, it's a failure.Prints
[FAIL]
in red with the associated message.
Note:
For this function to work properly, you should define color codes in your script, like:
GREEN='\033[0;32m'
RED='\033[0;31m'
RESET='\033[0m'
These define ANSI escape sequences to colorize terminal output.
✅ Sample Output: System Health Check
[OK] CPU Load: 0.45
[OK] Memory usage: 61% used
[OK] Disk usage on /: 47% used
[OK] Network: Ping to 8.8.8.8
[OK] Service 'sshd' is running
✅ Key Uses of the Script:
Proactive Monitoring:
- Helps detect early signs of system resource exhaustion (e.g., high CPU, memory, or disk usage).
Automation-Friendly:
- Can be run on a schedule (e.g., via cron) to generate periodic health reports or trigger alerts.
Troubleshooting Aid:
Quickly pinpoints common problems like:
High load on CPU
Low available memory
Nearly full disk
Network outages
Critical services not running (e.g.,
sshd
)
Custom Health Dashboards or Scripts:
- Outputs structured, color-coded messages that can be parsed by logging/monitoring tools (like Nagios, Prometheus exporters, or CI/CD pipelines).
Lightweight Alternative to Heavy Monitoring Tools:
- Ideal for minimal setups or containerized environments where you don’t want to install full-fledged monitoring agents.
Conclusion
This script serves as a simple yet powerful system health check utility that provides immediate visibility into critical aspects of a Linux server's performance and availability. By evaluating CPU load, memory usage, disk space, network connectivity, and service status, it enables:
Quick diagnostics during troubleshooting
Proactive monitoring to prevent outages
Lightweight integration into cron jobs, CI/CD pipelines, or custom dashboards
It’s an essential tool for DevOps, SysAdmins, and SREs who need fast, scriptable insights without relying on full-scale monitoring suites.
Subscribe to my newsletter
Read articles from Kishore directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
