I’m Ops guy at heart - I tend to learn by doing

I’ve been using the shell to just get sharper at understanding when where and why to write certain commands - think of this as a living breathing wiki.

I like to believe anything can be tackled with a bit of practice and the 20/80 rule - most problems have been solved before - which is great! That just means you have to take a little effort to internalize things.

I think there are some classic tasks anyone working with the shell should know how to do

Verify identity & host — whoami && hostname && uptime Why: confirm you’re on the right machine and check load. Real-world: check your badge & dashboard before touching equipment.
Current working dir & disk free — pwd; df -h . Why: know where you are and filesystem free space for current path. Parallel: check your workstation table and fuel gauge.
Show top CPU/mem processes — ps -eo pid,cmd,%cpu,%mem --sort=-%cpu | head -n 12 Why: find runaway processes quickly. Parallel: spot the machine hog in a rack.
Live interactive monitor — top or htop Why: live view of processes, CPU, memory, load. Parallel: a live camera on the assembly line.
Check system logs (journal) — journalctl -n 200 --no-pager Why: recent system-level events (boots, crashes). Parallel: read the last few pages of the maintenance logbook.
Tail an app log — tail -F /var/log/myapp.log Why: watch logs in real time during deploys or tests. Parallel: listen to the machine’s heartbeat.
Count recent ERRORs — tail -n 1000 /var/log/syslog | grep -cE "ERROR|CRITICAL" Why: quick metric for severity spike. Parallel: count how many failed items on the line.
Find top erroring services — tail -n 2000 /var/log/syslog | grep -Eo '([A-Za-z0-9_.+-]+)\[[0-9]+\]' | sed 's/\[.*//' | sort | uniq -c | sort -nr | head Why: identify the service that’s spamming errors. Parallel: which machine pulls the fire alarm most.
Disk usage by directory — du -sh /var/* 2>/dev/null | sort -hr | head -n 20 Why: find directories using most space. Parallel: which storeroom has the most pallets.
Largest files on filesystem — sudo find / -xdev -type f -printf '%s %p\n' 2>/dev/null | sort -nr | head -n 20 | awk '{printf "%.1f MB\t%s\n",$1/1024/1024,$2}' Why: locate big files before disk full issues. Parallel: find the oversized boxes blocking the aisle.
Recently modified files — find /var -type f -mtime -1 -ls 2>/dev/null | head -n 50 Why: detect recent changes to config/logs. Parallel: check which machines were serviced in last 24h.
Inode usage — df -i Why: avoid inode exhaustion (many small files). Parallel: number of slots on the shelf, not total volume.
Find many small files dir — find /path -maxdepth 2 -type d -exec bash -c 'echo -n "{}: "; find "{}" -type f | wc -l' \; | sort -nr -k2 | head Why: identify dirs with too many files (backup pain). Parallel: a warehouse full of tiny screws.
Check open deleted files — sudo lsof +L1 Why: processes holding deleted files still use space. Parallel: boxes tossed out but still attached to machines.
Who’s listening (sockets) — ss -tuln Why: verify expected services are bound to ports. Parallel: which doors are open in the building.
Check specific port/service health — curl -fsS -m 5 http://127.0.0.1:8080/health || echo "failed" Why: quick HTTP health probe. Parallel: take the pulse of a service.
DNS resolution test — dig +short example.com A Why: confirm DNS answers correctly. Parallel: check the address book returns the right warehouse address.
Traceroute / network path — tracepath 8.8.8.8 Why: see hops/latency to a remote host. Parallel: follow the delivery route to troubleshoot delays.
Check firewall rules — sudo iptables -L -n or sudo nft list ruleset Why: ensure traffic allowed or blocked as intended. Parallel: confirm which doors are locked.
Test TLS expiry — echo | openssl s_client -connect host:443 2>/dev/null | openssl x509 -noout -dates Why: check certificate expiry to avoid downtime. Parallel: check the passport expiration before travel.
Check service status — systemctl status <service> Why: health and start logs for a systemd unit. Parallel: pull up the service’s maintenance log.
Tail unit logs — journalctl -u <service> -n 200 --no-pager Why: focus on a single service’s history. Parallel: read the operator’s notes for that machine.
Restart service (safe) — sudo systemctl restart <service> && sudo systemctl status <service> Why: controlled restart then verify. Parallel: power-cycle a misbehaving piece of equipment and verify.
Check cron jobs — crontab -l; ls -l /etc/cron.* Why: ensure scheduled jobs exist and are running. Parallel: check the scheduled maintenance roster.
Check package versions — dpkg -l | head or rpm -qa | head Why: confirm software versions for troubleshooting. Parallel: check model numbers of components.
Check for available updates — sudo apt update && apt list --upgradable Why: plan security/patch windows. Parallel: list pending firmware updates.
Disk SMART health — sudo smartctl -a /dev/sda Why: detect failing disks early. Parallel: check the drive’s heartbeat for impending failure.
LVM status — sudo lvdisplay && sudo vgdisplay Why: verify logical volumes and free space. Parallel: check virtual storage partitions in the warehouse.
Mount points & fstab — mount | column -t and cat /etc/fstab Why: ensure expected mounts are present and persistent. Parallel: check whether storage racks are attached correctly.
Check NFS mounts & stats — mount | grep nfs and df -h /path/to/nfs Why: verify network storage availability. Parallel: ensure shared shelving is accessible.
Find broken symlinks — find / -type l ! -exec test -e {} \; -print 2>/dev/null | head -n 50 Why: stale symlinks can break services. Parallel: map pointing to a shelf that no longer exists.
World-writable files (audit) — sudo find / -perm -0002 -type f -printf '%M %u %g %p\n' 2>/dev/null | head -n 50 Why: security hygiene (unexpected writable files). Parallel: doors unlocked in restricted rooms.
Setuid/setgid binaries — sudo find / -perm /6000 -type f -printf '%M %u %g %p\n' 2>/dev/null Why: audit privileged binaries that can escalate access. Parallel: machines with master keys attached.
Orphaned files (no owner/group) — sudo find / -nouser -o -nogroup -print 2>/dev/null | head Why: leftover files after user deletion — security/cleanup. Parallel: orphaned crates with no owner sticker.
Find core dumps — sudo find / -type f -iname 'core*' -o -iname '*.core' 2>/dev/null | head Why: crash artifacts and debugging clues. Parallel: broken parts left after a machine fails.
Search file contents for secrets (careful) — sudo grep -RIn --exclude-dir={/proc,/sys,/dev} 'BEGIN RSA PRIVATE KEY\|AKIA' / 2>/dev/null | head Why: detect accidentally committed secrets (rotate if found). Parallel: detect leaked keys to the safe.
Duplicate files (potential reclaim) — fdupes -r /path || echo "install fdupes" Why: reclaim space by deduping (inspect before deleting). Parallel: two identical pallets stored twice.
Check recently created files — find /var/log -type f -ctime -1 -ls Why: spot new logs/configs after a deploy. Parallel: note new shipment arrivals.
Count files in directory — find /var/log -maxdepth 1 -type f | wc -l Why: detect directories with explosion of files. Parallel: one shelf suddenly overflowing.
Find files >100MB — find / -xdev -type f -size +100M -printf '%s %p\n' 2>/dev/null | sort -nr | head Why: identify unusually large files quickly. Parallel: oversized crates creating storage issues.
Check open ports & connections — ss -s && ss -tupan Why: summarize socket usage and active connections. Parallel: overall traffic and open doors.
View process FD usage — ls -l /proc/<PID>/fd (or lsof -p <PID>) Why: find files a process is using (logs, sockets). Parallel: what tools an operator currently has open.
Check file descriptors per process (leak detection) — for p in $(ps -e -o pid=); do echo -n "$p "; ls /proc/$p/fd 2>/dev/null | wc -l; done | sort -n -k2 -r | head Why: spot FD leaks. Parallel: see who left too many doors open.
Restart vs reload — sudo systemctl reload <service> (if supported) or restart Why: prefer reload for config changes when service supports it. Parallel: adjust settings without powering full cycle.
Reopen logs (logrotate safe) — kill -HUP <pid> (daemon that supports reopen) Why: make daemons reopen log files after rotation. Parallel: tell a machine to switch to a fresh logbook.
Rotate logs manually — mv /var/log/myapp.log /var/log/myapp.log.1 && kill -USR1 <pid> Why: emergency rotation if disk filling fast. Parallel: archive the current ledger and start a new one.
Atomic write pattern — echo '{"ok":true}' | (tmp=$(mktemp /tmp/out.tmp.XXXX) && cat >"$tmp" && mv "$tmp" /tmp/out.json) Why: avoid half-written files visible to readers. Parallel: stage a crate behind the curtain, then roll it into place.
Create a temporary directory for safe ops — tmpdir=$(mktemp -d) && echo $tmpdir Why: ephemeral workspace that’s unique and secure. Parallel: use a temporary workbench rather than the production table.
Check crontab health / last run — grep -R "cron" /var/log/* 2>/dev/null | tail -n 50 Why: see cron jobs’ recent outputs and failures. Parallel: confirm scheduled maintenance actually ran.
Minimal incident runbook template (capture quickly) — `cat > ~/incident-$(date +%FT%T).md <<'MD' && sed -n '1,20p' ~/incident-*.md

Rubber Ducking

I think a lot of the paranoia around shell scripting - from yours truly included comes from simply never having written these commands - so you have no mental model. No matter how smart you are - you need some degree of muscle memory: I’d hate to have a surgeon who never had to use a scalpel decide to use one on me today…

Keeping that in mind - I intend little by little to work up from practicing basic stuff to learning how to classic triage scripts and then we can keep working our way through classic issues. Perhaps this blog would be better off as a book?

Day in the life

Rubber Ducking

Subscribe to my newsletter

Saptarsi Guha

Saptarsi Guha