Lightweight Linux System Health Check Script

KishoreKishore
9 min read

In today's fast-paced DevOps and SRE environments, it's essential to have quick and reliable tools to assess the health of your servers — especially when you're dealing with performance issues or preparing for deployments. While there are powerful monitoring tools like Nagios, Prometheus, and Datadog, sometimes all you need is a simple shell script that gives you the basics, fast.

In this blog, I’ll walk you through a lightweight bash script that checks five critical aspects of any Linux system:

  • CPU load (1-minute average)

  • Memory usage

  • Disk usage on the root filesystem

  • Network connectivity (ping test)

  • Status of a critical service like sshd

This script is easy to use, highly portable, and automation-friendly, making it perfect for cron jobs, pre-deployment hooks, or quick diagnostics during incident response.

1. CPU Usage:

This shell script snippet checks the 1-minute average CPU load on a system and compares it against a defined threshold. If the load exceeds the threshold, it flags it as a failure; otherwise, it's considered OK.

# 1. Check CPU Load (only 1-minute average)

cpu_load=$(uptime | awk -F'load average:' '{print $2}' | cut -d',' -f1 | xargs)  # Get 1-minute load average
cpu_threshold=1.5  # Set CPU load threshold value

cpu_check=$(echo "$cpu_load > $cpu_threshold" | bc -l)  # Compare using bc

print_status "$cpu_check" "CPU Load: $cpu_load"  # Print status based on comparison

Line-by-Line Explanation:

cpu_load=$(uptime | awk -F'load average:' '{print $2}' | cut -d',' -f1 | xargs)
  • Purpose: Extracts the 1-minute CPU load average from the uptime command output.

  • How:

    • uptime gives system load averages (1, 5, and 15 minutes).

    • awk isolates the part after "load average:".

    • cut gets only the first value (1-minute load).

    • xargs trims any leading/trailing whitespace.


cpu_threshold=1.5
  • Purpose: Defines a threshold value for CPU load.

  • Explanation: If the CPU load is above this value (1.5), it's considered high.


cpu_check=$(echo "$cpu_load > $cpu_threshold" | bc -l)
  • Purpose: Compares the actual CPU load against the threshold using the bc calculator.

  • Returns:

    • 1 if the load is greater than the threshold.

    • 0 if the load is less than or equal to the threshold.


print_status "$cpu_check" "CPU Load: $cpu_load"
  • Purpose: Calls the print_status function to output a status message.

  • Behavior: Depending on cpu_check:

    • If 1, it prints [FAIL] CPU Load: x.xx

    • If 0, it prints [OK] CPU Load: x.xx


2. Check Memory Usage:

This shell script snippet checks system memory usage and evaluates whether it crosses a specified threshold (90% usage in this case). If usage is below the threshold, it reports OK; otherwise, it indicates a failure.

# 2. Check Memory Usage (in percentage)
mem_free=$(free -m | awk '/^Mem:/ {print $4}') # Free memory in MB
mem_total=$(free -m | awk '/^Mem:/ {print $2}') # Total memory in MB

mem_percent=$((100 - mem_free * 100 / mem_total))# Memory used in percentage

if [ "$mem_percent" -lt 90 ]; then
    mem_status=0   # OK status if usage is below threshold
else
    mem_status=1  # FAIL status otherwise
fi

print_status "$mem_status" "Memory usage: ${mem_percent}% used"  # Print memory status

Line-by-Line Explanation:

mem_free=$(free -m | awk '/^Mem:/ {print $4}')
  • Purpose: Gets the amount of free memory in MB.

  • Command breakdown:

    • free -m shows memory in megabytes.

    • awk '/^Mem:/ {print $4}' filters the line starting with Mem: and prints the 4th column, which represents free memory.


mem_total=$(free -m | awk '/^Mem:/ {print $2}')
  • Purpose: Gets the total memory in MB.

  • Explanation: Again uses awk on the Mem: line but retrieves the 2nd column for total physical memory.


mem_percent=$((100 - mem_free * 100 / mem_total))
  • Purpose: Calculates percentage of memory used.

  • Formula: used = 100 - (free * 100 / total)

    • This gives a percentage value of memory currently utilized.

if [ "$mem_percent" -lt 90 ]; then
    mem_status=0
else
    mem_status=1
fi
  • Purpose: Compares the memory usage against the threshold (90%).

  • Result:

    • If usage is under 90%, status is 0 (OK).

    • If usage is 90% or higher, status is 1 (FAIL).


print_status "$mem_status" "Memory usage: ${mem_percent}% used"
  • Purpose: Calls a function to print a status message.

  • Output:

    • [OK] Memory usage: xx% used (if under threshold)

    • [FAIL] Memory usage: xx% used (if over threshold)


3. Check Disk Usage on Root Filesystem (/):

This shell script snippet checks the percentage of disk space used on the root (/) filesystem, and compares it to a 90% usage threshold. If usage is below the threshold, it returns OK; otherwise, it flags a failure.

# 3. Check if the 'sshd' service is running using systemctl

if systemctl is-active --quiet sshd; then
    print_status 0 "Service 'sshd' is running"
else
    print_status 1 "Service 'sshd' is NOT running"
fi

Line-by-Line Explanation:

disk_usage=$(df / | awk 'NR==2 {print $5}' | tr -d '%')
  • Purpose: Retrieves the disk usage percentage (without the % symbol) for the root filesystem.

  • How it works:

    • df / shows disk usage stats for the root mount point.

    • awk 'NR==2 {print $5}' selects the second line (skipping the header) and the fifth column, which is the usage percentage.

    • tr -d '%' removes the % symbol to make it a pure number for comparison.


if [ "$disk_usage" -lt 90 ]; then
    disk_status=0
else
    disk_status=1
fi
  • Purpose: Compares the disk usage value with the 90% threshold.

  • Result:

    • If usage is less than 90%, disk_status is set to 0 (OK).

    • If usage is 90% or more, it's set to 1 (FAIL).


print_status "$disk_status" "Disk usage on /: ${disk_usage}% used"
  • Purpose: Outputs the disk usage status using the print_status function.

  • Output Example:

    • [OK] Disk usage on /: 65% used

    • [FAIL] Disk usage on /: 91% used


4. Ping Test – Check Network Reachability

This shell script snippet tests whether the system has basic network connectivity by pinging a reliable public IP address — Google's DNS (8.8.8.8). If the ping is successful, it reports OK; otherwise, it flags a failure.

# 4. Ping Test - Check if the network is reachable by pinging Google's DNS (8.8.8.8)

ping -c 1 8.8.8.8 > /dev/null 2>&1 # Ping Google DNS to verify network connection
ping_status=$?    # 0 = success, non-zero = failure

print_status "$ping_status" "Network: Ping to 8.8.8.8"  # Print network status

Line-by-Line Explanation:

ping -c 1 8.8.8.8 > /dev/null 2>&1
  • Purpose: Sends 1 ICMP packet to 8.8.8.8 to check if the network is reachable.

  • Redirection:

    • > /dev/null suppresses the standard output.

    • 2>&1 also suppresses standard error.

  • Why 8.8.8.8? It’s a globally reachable, highly reliable IP — ideal for basic connectivity checks.


ping_status=$?
  • Purpose: Captures the exit status of the ping command.

  • Value:

    • 0 → ping was successful (network is reachable).

    • Non-zero → ping failed (network issue or no connectivity).


print_status "$ping_status" "Network: Ping to 8.8.8.8"
  • Purpose: Prints the result of the ping test using the print_status function.

  • Output Example:

    • [OK] Network: Ping to 8.8.8.8

    • [FAIL] Network: Ping to 8.8.8.8


5. Check if a Critical Service (sshd) is Running

This shell script snippet checks the status of a critical system service — in this case, sshd (the SSH daemon) — to ensure it's actively running. It's useful for validating that key services remain operational.

# 5. Check if critical service (e.g., sshd) is running
# Using 'systemctl' to check the status of the 'sshd' service
if systemctl is-active --quiet sshd; then
    print_status 0 "Service 'sshd' is running"
else
    print_status 1 "Service 'sshd' is NOT running"
fi

You’ve been building a really sharp set of diagnostics here, Kishore. If you're thinking

Line-by-Line Explanation:

if systemctl is-active --quiet sshd; then
  • Purpose: Checks if the sshd service is active using systemctl.

  • Details:

    • systemctl is-active sshd returns active if the service is running.

    • --quiet suppresses output — only the return code is used.

    • If the command succeeds (exit code 0), it means the service is running.


print_status 0 "Service 'sshd' is running"
  • Executed when: sshd is running.

  • Purpose: Uses the print_status function to report success with an [OK] status.


else
    print_status 1 "Service 'sshd' is NOT running"
fi
  • Executed when: sshd is not running.

  • Purpose: Reports a [FAIL] status via the same print_status function, alerting the user/admin.


Print Color-Coded Status Messages

The print_status function is a utility designed to display status messages in a clear, color-coded format — typically used in health checks or monitoring scripts. It prints [OK] in green when the status is success (0) and [FAIL] in red when it fails (non-zero).


Line-by-Line Explanation:

print_status() {
    local status=$1  # $1 is the exit/status code passed to the function (0 = success)
    local message=$2 # $2 is the message to be printed (context of the check)
  • Declares a function named print_status.

  • Accepts two arguments:

    • status: exit code (0 for success, non-zero for failure).

    • message: descriptive message about what was tested.


if [ "$status" -eq 0 ]; then
echo -e "${GREEN} [OK]${RESET} $message"
  • If status == 0, it's considered a successful check.

  • Uses echo -e to:

    • Enable interpretation of escape sequences (for colors).

    • Print [OK] in green followed by the message.


    else
        echo -e "${RED} [FAIL]${RESET} $message"
    fi
}
  • If status != 0, it's a failure.

  • Prints [FAIL] in red with the associated message.


Note:

For this function to work properly, you should define color codes in your script, like:

GREEN='\033[0;32m'
RED='\033[0;31m'
RESET='\033[0m'

These define ANSI escape sequences to colorize terminal output.

Sample Output: System Health Check

 [OK] CPU Load: 0.45
 [OK] Memory usage: 61% used
 [OK] Disk usage on /: 47% used
 [OK] Network: Ping to 8.8.8.8
 [OK] Service 'sshd' is running

Key Uses of the Script:

  1. Proactive Monitoring:

    • Helps detect early signs of system resource exhaustion (e.g., high CPU, memory, or disk usage).
  2. Automation-Friendly:

    • Can be run on a schedule (e.g., via cron) to generate periodic health reports or trigger alerts.
  3. Troubleshooting Aid:

    • Quickly pinpoints common problems like:

      • High load on CPU

      • Low available memory

      • Nearly full disk

      • Network outages

      • Critical services not running (e.g., sshd)

  4. Custom Health Dashboards or Scripts:

    • Outputs structured, color-coded messages that can be parsed by logging/monitoring tools (like Nagios, Prometheus exporters, or CI/CD pipelines).
  5. Lightweight Alternative to Heavy Monitoring Tools:

    • Ideal for minimal setups or containerized environments where you don’t want to install full-fledged monitoring agents.

Conclusion

This script serves as a simple yet powerful system health check utility that provides immediate visibility into critical aspects of a Linux server's performance and availability. By evaluating CPU load, memory usage, disk space, network connectivity, and service status, it enables:

  • Quick diagnostics during troubleshooting

  • Proactive monitoring to prevent outages

  • Lightweight integration into cron jobs, CI/CD pipelines, or custom dashboards

It’s an essential tool for DevOps, SysAdmins, and SREs who need fast, scriptable insights without relying on full-scale monitoring suites.

0
Subscribe to my newsletter

Read articles from Kishore directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Kishore
Kishore