lightweight-linux-system-health-check-script

In today's fast-paced DevOps and SRE environments, it's essential to have quick and reliable tools to assess the health of your servers — especially when you're dealing with performance issues or preparing for deployments. While there are powerful monitoring tools like Nagios, Prometheus, and Datadog, sometimes all you need is a simple shell script that gives you the basics, fast.

In this blog, I’ll walk you through a lightweight bash script that checks five critical aspects of any Linux system:

CPU load (1-minute average)
Memory usage
Disk usage on the root filesystem
Network connectivity (ping test)
Status of a critical service like sshd

This script is easy to use, highly portable, and automation-friendly, making it perfect for cron jobs, pre-deployment hooks, or quick diagnostics during incident response.

1. CPU Usage:

This shell script snippet checks the 1-minute average CPU load on a system and compares it against a defined threshold. If the load exceeds the threshold, it flags it as a failure; otherwise, it's considered OK.

# 1. Check CPU Load (only 1-minute average)

cpu_load=$(uptime | awk -F'load average:' '{print $2}' | cut -d',' -f1 | xargs)  # Get 1-minute load average
cpu_threshold=1.5  # Set CPU load threshold value

cpu_check=$(echo "$cpu_load > $cpu_threshold" | bc -l)  # Compare using bc

print_status "$cpu_check" "CPU Load: $cpu_load"  # Print status based on comparison

Line-by-Line Explanation:

cpu_load=$(uptime | awk -F'load average:' '{print $2}' | cut -d',' -f1 | xargs)

Purpose: Extracts the 1-minute CPU load average from the uptime command output.
How:
- uptime gives system load averages (1, 5, and 15 minutes).
- awk isolates the part after "load average:".
- cut gets only the first value (1-minute load).
- xargs trims any leading/trailing whitespace.

cpu_threshold=1.5

Purpose: Defines a threshold value for CPU load.
Explanation: If the CPU load is above this value (1.5), it's considered high.

cpu_check=$(echo "$cpu_load > $cpu_threshold" | bc -l)

Purpose: Compares the actual CPU load against the threshold using the bc calculator.
Returns:
- 1 if the load is greater than the threshold.
- 0 if the load is less than or equal to the threshold.

print_status "$cpu_check" "CPU Load: $cpu_load"

Purpose: Calls the print_status function to output a status message.
Behavior: Depending on cpu_check:
- If 1, it prints [FAIL] CPU Load: x.xx
- If 0, it prints [OK] CPU Load: x.xx

2. Check Memory Usage:

This shell script snippet checks system memory usage and evaluates whether it crosses a specified threshold (90% usage in this case). If usage is below the threshold, it reports OK; otherwise, it indicates a failure.

# 2. Check Memory Usage (in percentage)
mem_free=$(free -m | awk '/^Mem:/ {print $4}') # Free memory in MB
mem_total=$(free -m | awk '/^Mem:/ {print $2}') # Total memory in MB

mem_percent=$((100 - mem_free * 100 / mem_total))# Memory used in percentage

if [ "$mem_percent" -lt 90 ]; then
    mem_status=0   # OK status if usage is below threshold
else
    mem_status=1  # FAIL status otherwise
fi

print_status "$mem_status" "Memory usage: ${mem_percent}% used"  # Print memory status

Line-by-Line Explanation:

mem_free=$(free -m | awk '/^Mem:/ {print $4}')

Purpose: Gets the amount of free memory in MB.
Command breakdown:
- free -m shows memory in megabytes.
- awk '/^Mem:/ {print $4}' filters the line starting with Mem: and prints the 4th column, which represents free memory.

mem_total=$(free -m | awk '/^Mem:/ {print $2}')

Purpose: Gets the total memory in MB.
Explanation: Again uses awk on the Mem: line but retrieves the 2nd column for total physical memory.

mem_percent=$((100 - mem_free * 100 / mem_total))

Purpose: Calculates percentage of memory used.
Formula: used = 100 - (free * 100 / total)
- This gives a percentage value of memory currently utilized.

if [ "$mem_percent" -lt 90 ]; then
    mem_status=0
else
    mem_status=1
fi

Purpose: Compares the memory usage against the threshold (90%).
Result:
- If usage is under 90%, status is 0 (OK).
- If usage is 90% or higher, status is 1 (FAIL).

print_status "$mem_status" "Memory usage: ${mem_percent}% used"

Purpose: Calls a function to print a status message.
Output:
- [OK] Memory usage: xx% used (if under threshold)
- [FAIL] Memory usage: xx% used (if over threshold)

3. Check Disk Usage on Root Filesystem (`/`):

This shell script snippet checks the percentage of disk space used on the root (/) filesystem, and compares it to a 90% usage threshold. If usage is below the threshold, it returns OK; otherwise, it flags a failure.

# 3. Check if the 'sshd' service is running using systemctl

if systemctl is-active --quiet sshd; then
    print_status 0 "Service 'sshd' is running"
else
    print_status 1 "Service 'sshd' is NOT running"
fi

Line-by-Line Explanation:

disk_usage=$(df / | awk 'NR==2 {print $5}' | tr -d '%')

Purpose: Retrieves the disk usage percentage (without the % symbol) for the root filesystem.
How it works:
- df / shows disk usage stats for the root mount point.
- awk 'NR==2 {print $5}' selects the second line (skipping the header) and the fifth column, which is the usage percentage.
- tr -d '%' removes the % symbol to make it a pure number for comparison.

if [ "$disk_usage" -lt 90 ]; then
    disk_status=0
else
    disk_status=1
fi

Purpose: Compares the disk usage value with the 90% threshold.
Result:
- If usage is less than 90%, disk_status is set to 0 (OK).
- If usage is 90% or more, it's set to 1 (FAIL).

print_status "$disk_status" "Disk usage on /: ${disk_usage}% used"

Purpose: Outputs the disk usage status using the print_status function.
Output Example:
- [OK] Disk usage on /: 65% used
- [FAIL] Disk usage on /: 91% used

4. Ping Test – Check Network Reachability

This shell script snippet tests whether the system has basic network connectivity by pinging a reliable public IP address — Google's DNS (8.8.8.8). If the ping is successful, it reports OK; otherwise, it flags a failure.

# 4. Ping Test - Check if the network is reachable by pinging Google's DNS (8.8.8.8)

ping -c 1 8.8.8.8 > /dev/null 2>&1 # Ping Google DNS to verify network connection
ping_status=$?    # 0 = success, non-zero = failure

print_status "$ping_status" "Network: Ping to 8.8.8.8"  # Print network status

Line-by-Line Explanation:

ping -c 1 8.8.8.8 > /dev/null 2>&1

Purpose: Sends 1 ICMP packet to 8.8.8.8 to check if the network is reachable.
Redirection:
- > /dev/null suppresses the standard output.
- 2>&1 also suppresses standard error.
Why 8.8.8.8? It’s a globally reachable, highly reliable IP — ideal for basic connectivity checks.

ping_status=$?

Purpose: Captures the exit status of the ping command.
Value:
- 0 → ping was successful (network is reachable).
- Non-zero → ping failed (network issue or no connectivity).

print_status "$ping_status" "Network: Ping to 8.8.8.8"

Purpose: Prints the result of the ping test using the print_status function.
Output Example:
- [OK] Network: Ping to 8.8.8.8
- [FAIL] Network: Ping to 8.8.8.8

5. Check if a Critical Service (`sshd`) is Running

This shell script snippet checks the status of a critical system service — in this case, sshd (the SSH daemon) — to ensure it's actively running. It's useful for validating that key services remain operational.

# 5. Check if critical service (e.g., sshd) is running
# Using 'systemctl' to check the status of the 'sshd' service
if systemctl is-active --quiet sshd; then
    print_status 0 "Service 'sshd' is running"
else
    print_status 1 "Service 'sshd' is NOT running"
fi

You’ve been building a really sharp set of diagnostics here, Kishore. If you're thinking

Line-by-Line Explanation:

if systemctl is-active --quiet sshd; then

Purpose: Checks if the sshd service is active using systemctl.
Details:
- systemctl is-active sshd returns active if the service is running.
- --quiet suppresses output — only the return code is used.
- If the command succeeds (exit code 0), it means the service is running.

print_status 0 "Service 'sshd' is running"

Executed when: sshd is running.
Purpose: Uses the print_status function to report success with an [OK] status.

else
    print_status 1 "Service 'sshd' is NOT running"
fi

Executed when: sshd is not running.
Purpose: Reports a [FAIL] status via the same print_status function, alerting the user/admin.

Print Color-Coded Status Messages

The print_status function is a utility designed to display status messages in a clear, color-coded format — typically used in health checks or monitoring scripts. It prints [OK] in green when the status is success (0) and [FAIL] in red when it fails (non-zero).

Line-by-Line Explanation:

print_status() {
    local status=$1  # $1 is the exit/status code passed to the function (0 = success)
    local message=$2 # $2 is the message to be printed (context of the check)

Declares a function named print_status.
Accepts two arguments:
- status: exit code (0 for success, non-zero for failure).
- message: descriptive message about what was tested.

if [ "$status" -eq 0 ]; then
echo -e "${GREEN} [OK]${RESET} $message"

If status == 0, it's considered a successful check.
Uses echo -e to:
- Enable interpretation of escape sequences (for colors).
- Print [OK] in green followed by the message.

    else
        echo -e "${RED} [FAIL]${RESET} $message"
    fi
}

If status != 0, it's a failure.
Prints [FAIL] in red with the associated message.

Note:

For this function to work properly, you should define color codes in your script, like:

GREEN='\033[0;32m'
RED='\033[0;31m'
RESET='\033[0m'

These define ANSI escape sequences to colorize terminal output.

✅ Sample Output: System Health Check

 [OK] CPU Load: 0.45
 [OK] Memory usage: 61% used
 [OK] Disk usage on /: 47% used
 [OK] Network: Ping to 8.8.8.8
 [OK] Service 'sshd' is running

✅ Key Uses of the Script:

Proactive Monitoring:
- Helps detect early signs of system resource exhaustion (e.g., high CPU, memory, or disk usage).
Automation-Friendly:
- Can be run on a schedule (e.g., via cron) to generate periodic health reports or trigger alerts.
Troubleshooting Aid:
- Quickly pinpoints common problems like:
  - High load on CPU
  - Low available memory
  - Nearly full disk
  - Network outages
  - Critical services not running (e.g., sshd)
Custom Health Dashboards or Scripts:
- Outputs structured, color-coded messages that can be parsed by logging/monitoring tools (like Nagios, Prometheus exporters, or CI/CD pipelines).
Lightweight Alternative to Heavy Monitoring Tools:
- Ideal for minimal setups or containerized environments where you don’t want to install full-fledged monitoring agents.

Conclusion

This script serves as a simple yet powerful system health check utility that provides immediate visibility into critical aspects of a Linux server's performance and availability. By evaluating CPU load, memory usage, disk space, network connectivity, and service status, it enables:

Quick diagnostics during troubleshooting
Proactive monitoring to prevent outages
Lightweight integration into cron jobs, CI/CD pipelines, or custom dashboards

It’s an essential tool for DevOps, SysAdmins, and SREs who need fast, scriptable insights without relying on full-scale monitoring suites.

Lightweight Linux System Health Check Script

Line-by-Line Explanation:

2. Check Memory Usage:

Line-by-Line Explanation:

3. Check Disk Usage on Root Filesystem (`/`):

Line-by-Line Explanation:

4. Ping Test – Check Network Reachability

Line-by-Line Explanation:

5. Check if a Critical Service (`sshd`) is Running

Line-by-Line Explanation:

Print Color-Coded Status Messages

Line-by-Line Explanation:

Note:

✅ Sample Output: System Health Check

✅ Key Uses of the Script:

Conclusion

Subscribe to my newsletter

Kishore

Kishore

Lightweight Linux System Health Check Script

Line-by-Line Explanation:

2. Check Memory Usage:

Line-by-Line Explanation:

3. Check Disk Usage on Root Filesystem (/):

Line-by-Line Explanation:

4. Ping Test – Check Network Reachability

Line-by-Line Explanation:

5. Check if a Critical Service (sshd) is Running

Line-by-Line Explanation:

Print Color-Coded Status Messages

Line-by-Line Explanation:

Note:

✅ Sample Output: System Health Check

✅ Key Uses of the Script:

Conclusion

Subscribe to my newsletter

Kishore

Kishore

3. Check Disk Usage on Root Filesystem (`/`):

5. Check if a Critical Service (`sshd`) is Running